SlideShare a Scribd company logo
1 of 17
Download to read offline
Report For Data Mining Project
Xiumo Zhan xiumoz@sfu.ca
Bowen Sun bsa58@sfu.ca
Abstract
This project use Mapreduce programming to find all frequent itemsets among the transaction in
the given file in two passes. We use java as programming language and Eclipse with Hadoop
pluggin as the development environment. In this project, we use two passes to implement
Mapreduce with the SON algorithm and Apriori algorithm. Finally, our Mapreduce program can
achieve the expected result according to the given file, a parameter k as the number of subfiles
and a parameter s as support threshold.
Implementation
Our implementation is applied by SON algorithm. This algorithm consists of two passes, each of
which requires one Map function and one Reduce function. The SON algorithm lends itself well
to a parallel-computing environment: each of the chunks can be processed in parallel, and the
frequent itemsets from each chunk combined to form the candidates. So in order to simulate the
parallel computing environment, we build a pseudo distributed model Hadoop environment on
the Ubuntu system running as a virtual machine using Vmware workstation in our own laptop.
Pass 1
In Pass 1, we first divide the entire big file into k subfiles, and the input of each mapper is one of
the k subfiles. Then we implement the Apriori algorithm on each subfile. While applying the
Apriori algorithm, we read the entire input file and then divide them to lines, which represent the
baskets. We then split the items of each line and use the data structure list<String[]> to store the
distinct items, which is our candidate frequent 1-itemset 𝐢𝐢1. Then we compute the support of
each items in 𝐢𝐢1 to generate 𝐿𝐿1 and use this 𝐿𝐿1 to form the pairs 𝐢𝐢2. For 𝐢𝐢2, we have to
check if it can reach the threshold 𝑠𝑠 and then generate 𝐿𝐿2.
For any π‘˜π‘˜ β‰₯ 3, if we want to self join πΏπΏπ‘˜π‘˜βˆ’1 to form πΆπΆπ‘˜π‘˜ we have to compare the first π‘˜π‘˜ βˆ’ 2
elements for each two itemsets in πΏπΏπ‘˜π‘˜βˆ’1. For instance, there are two 3-itemsets β€œ234” and β€œ235”
in L3, so we will check if the first two elements in these two itemsets are the same. The pseudo
code of this procedure is described in the following:
Combine(itemset1, itemset2)
set point=0;
set key={};
for i=1 to the length of both itemsets
if itemset1[i]==itemset2[i]
point=point+1;
key=key+itemset1[i];
else
break;
endif
endfor
if point==length of both itemsets
if itemset1[point+1]>itemset2[point+1]
key=key+itemset2[point+1]+itemset1[point+1];
else
key=key+itemset1[point+1]+itemset2[point+1];
endif
endif
We can use these β€œ234” and β€œ235” to form β€œ2345”, we have to check if it is qualified to stay in the
𝐢𝐢4. We have known that β€œ234” and β€œ235” are already in the 𝐿𝐿3, so we just need to check if
both β€œ245” and ”345” are in the 𝐿𝐿3 instead of checking all four 3-itemsets, which will avoid
unnecessary check. In practical programming, we notice that the itemsets that needs to be
checked are the set of itemsets containing the last two items and without one of arbitrary k βˆ’ 2
items for the πΆπΆπ‘˜π‘˜ . So we will continue the self join process using πΏπΏπ‘˜π‘˜ until the the generated
πΆπΆπ‘˜π‘˜+1 is empty. The pseudo code of our checking procedure can be written as the following:
Check(itemset, πΏπΏπ‘˜π‘˜βˆ’1)
set flag=1;
for i=1 to length of itemset-3
set subitemset[i]=delete itemset[i] from the itemset;
if subitemset[i] not exists in πΏπΏπ‘˜π‘˜βˆ’1
set flag=0;
break;
endif
endfor
if flag=0
delete this itemset from πΏπΏπ‘˜π‘˜;
else
keep this itemset in πΏπΏπ‘˜π‘˜;
endif
The result produced by the mapper is the candidate frequent itemsets of each subfile. Then
reducer use our first reduce function to prune the duplicated itemsets in the output of the
mapper. We notice that the first reduce function will ignore the value of support for each
itemsets, the task of computation of the actual support of each itemset will be assigned to Pass
2.
After we produce all the candidate itemsets using Apriori algorithm in Pass 1, we will output a file
of all the candidate itemsets in the format of <”itemset”, ”value”>, β€œvalue” is set to be 1 since we
need to collect all distinct candidate frequent itemset. Storing all candidate itemsets is necessary
for our algorithm to carry on because it will ensure that the candidate itemsets that are produced
in Pass 1 will be passed to the next pass.
The pseudo code of the whole procedure of Pass 1 can be represented as the following:
Class FirstMapper
Method FirstMapper( inputfile)
set i=1;
while(|𝐢𝐢𝑖𝑖| > 0)
for each πΆπΆπ‘–π‘–π‘˜π‘˜ in 𝐢𝐢𝑖𝑖
πΆπΆπ‘–π‘–π‘˜π‘˜ =<Cik.key, πΆπΆπ‘–π‘–π‘˜π‘˜. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠=computesupport(inputfile,Cik)>;
endfor
𝐿𝐿𝑖𝑖=cut(𝐢𝐢𝑖𝑖);
Result=Result+𝐿𝐿𝑖𝑖;
𝐢𝐢𝑖𝑖+1=self-join(𝐿𝐿𝑖𝑖);
i++;
Output <Result.key, 1>;
Class FirstReducer
Method FirstReducer(keys, values)
Collect all distinct candidate frequent itemset;
Pass 2
In Pass 2, we will first read the output file produced in Pass 1. In Pass 2, the task of mapper is to
compute the number of appearance of the candidate frequent itemsets in each subfile.
To finish this, we use the subfiles and implement the second Map and Reduce function. The
second mapper produces the number of appearance of each candidate frequent itemset in each
subfile and transmit the result in the format <”key”,”value”> to the second reducer.
The second reducer will sum the values for the same key and will generate a new pair for each
candidate frequent itemsets. The reducer then eliminates the itemsets whose value of support
(
𝑠𝑠𝑠𝑠𝑠𝑠(𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑣𝑣𝑣𝑣)
𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 π‘œπ‘œπ‘œπ‘œ π‘π‘π‘π‘π‘π‘π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜
) is smaller than s. The pseudo code can be written as the following:
Class SecondMapper
Method SecondMapper(result of the pass1,subfile)
Result2={};
For each key in the result of pass1
Count=number of baskets key[i] appears in the subfile;
Result2=Result2+<key[i], count>;
Endfor
Return Result2;
Class SecondReducer
Method SecondReducer(keys, values)
for each keys[i] belongs to the input keys
values=sum of values for the same key;
ComputeAndCut(<keys[i], values>, s);
Endfor
Output <Keys, Values>;
Test
We have tried many kinds of input data to test our program. We use data with the transaction
and baskets in small size to test the correctness of the result of our program. If our input data is a
small transaction but with very big baskets, running time is still huge. If the length of baskets are
not very long, the running time can be short. We also tried different k and s and find out that
running time is relatively longer if k is too large or too small for the same s. For the same k, if
we increase the value of s, running time will be shorter and can be very long if s is very small.
So far, our program can finish processing the example.dat given parameters k=30 and s=0.02
within 3 minutes. This efficiency is much better than what we achieved at the beginning, which is
longer than one hour.
Discussion
Running time and memory space consumption are very critical factors affect the efficiency of the
program. At first, our program can work well only when the input file has small baskets.
While checking the performance of our program, we found a very fatal flaw of our program, that
is the way we read our file in the Map function in Pass 1. In the Hadoop Mapper class, the default
way of reading files of the mapper is just one line at a time. In fact, the Apriori algorithm needs
the program to read the entire input file to count the size of the transaction. In this case, the
support threshold we define is not utilized to prune πΆπΆπ‘˜π‘˜, because each itemset in this case will all
have the support of 100%. Apriori algorithm, in this case, is actually not working. In fact, it only
enumerates all the subsets of the baskets and output the result to the reducer as the candidate
frequent itemsets that we need to count the actual support in the next pass. This will need huge
memory space and too much time. After searching on the internet, we found that overriding
InputFormat class and Recordreader class can solve this problem.
Moreover, our program is very sensible to the number of subfiles and the value of the threshold.
If we produce too many subfiles, each subfile will be so small that the support of each itemset in
the subfile will be relatively large. In this case, the support threshold is also useless and Pass 1
will generate big size of candidate itemsets, which results in huge demand of time and memory
space. For the same reason, if s is too small, the number of candidate itemsets will also be large.
And if k is very small, each subfile will be very big and it costs a long time to process the whole
file since large number of baskets will result in 𝐢𝐢1 with big size, as well as the huge cost of time
and memory space.
While we have already made some progress to the improvement of the efficiency of our program,
the computing of 𝐢𝐢1, 𝐢𝐢2, 𝐿𝐿1 and 𝐿𝐿2 still takes a lot of time.
Generating 𝐢𝐢1 is slow because the program split each basket to get items, this procedure will
process file extensively. Reading and processing file in this case could be time-costly. Generating
𝐿𝐿1 is always slow when 𝐢𝐢1 is of large size, because the algorithm we use has to traverse all the
baskets to count the number of appearance of every item in 𝐢𝐢1. Generating 𝐢𝐢2 from 𝐿𝐿1 is also
a time costly step, since if |𝐿𝐿1| = 𝑛𝑛, then |𝐢𝐢2| = 𝑂𝑂(𝑛𝑛2
). Generating 𝐿𝐿2 is also slow because of
the size of 𝐢𝐢2. When k β‰₯ 3, the process will be much faster because the program can effectively
cut itemset in πΆπΆπ‘˜π‘˜, using the monotonicity of frequent itemset.
Our future approach of improving our program consists of the 4 following aspects.
1.Change the data structure of storing our candidate frequent itemsets.
2.Prune the redundancy of operation and data structure for our algorithm in Pass 1.
3.Combine our current algorithm other algorithms like PCY, multihash to improve the efficiency.
4.Also we need to find ways to determine the proper k for the given file and s.
Appendix
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URI;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.TreeMap;
import java.util.TreeSet;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.FileStatus;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.FileOutputFormat;
import org.apache.hadoop.mapred.FileSplit;
import org.apache.hadoop.mapred.InputSplit;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.RecordReader;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.TextInputFormat;
import org.apache.hadoop.mapred.TextOutputFormat;
public class FrequentItemset_MapReduce {
static double s = 0.0;
static int total = 0;
static int partition = 1;
public static final String STRING_SPLIT = ",";
static List<String> FirstResult = new ArrayList<String>();
public static IntWritable one = new IntWritable(1);
public static boolean contain(String[] src, String[] dest) {
for (int i = 0; i < dest.length; i++) {
int j = 0;
for (; j < src.length; j++) {
if (src[j].equals(dest[i])) {
break;
}
}
if (j == src.length) {
return false;// can not find
}
}
return true;
}
public static class CandidateItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
@Override
public void map(LongWritable arg0, Text value,
OutputCollector<Text, IntWritable> output, Reporter arg3)
throws IOException {
List<String[]> data = null;
try {
data = loadChessData(value);
} catch (Exception e) {
e.printStackTrace();
}
Map<String, Double> result = compute(data, s, null, null);
for (String key : result.keySet()) {
output.collect(new Text(key), one);
}
}
public Map<String, Double> compute(List<String[]> data,
Double minSupport, Integer maxLoop, String[] containSet) {
if (data == null || data.size() <= 0) {
return null;
}
Map<String, Double> result = new TreeMap<String, Double>();
Map<String, Double> tempresult = new HashMap<String, Double>();
String[] itemSet = getDataUnitSet(data);
int loop = 0;
// loop1
Set<String> keys = combine(tempresult.keySet(), itemSet);
tempresult.clear();
for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
result.putAll(tempresult);
loop++;
String[] strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
while (true) {
keys = combine(tempresult.keySet(), strSet);
tempresult.clear();
for (String key : keys) {
tempresult.put(key,
computeSupport(data, key.split(STRING_SPLIT)));
}
cut(tempresult, minSupport);
strSet = new String[tempresult.size()];
tempresult.keySet().toArray(strSet);
result.putAll(tempresult);
loop++;
if (tempresult.size() <= 0) {
break;
}
if (maxLoop != null && maxLoop > 0 && loop >= maxLoop) {
break;
}
}
return result;
}
public Double computeSupport(List<String[]> data, String[] subSet) {
Integer value = 0;
for (int i = 0; i < data.size(); i++) {
if (contain(data.get(i), subSet)) {
value++;
}
}
return value * 1.0 / data.size();
}
public String[] getDataUnitSet(List<String[]> data) {
List<String> uniqueKeys = new ArrayList<String>();
for (String[] dat : data) {
for (String da : dat) {
if (!uniqueKeys.contains(da)) {
uniqueKeys.add(da);
}
}
}
// String[] toBeStored = list.toArray(new String[list.size()]);
String[] result = uniqueKeys.toArray(new String[uniqueKeys.size()]);
return result;
}
public Set<String> combine(Set<String> src, String[] target) {
Set<String> dest = new TreeSet<String>();
if (src == null || src.size() <= 0) {
for (String t : target) {
dest.add(t.toString());
}
return dest;
}
for (String s : src) {
for (String t : target) {
String[] itemset1 = s.split(STRING_SPLIT);
String[] itemset2 = t.split(STRING_SPLIT);
int i = 0;
for (i = 0; i < itemset1.length - 1
&& i < itemset2.length - 1; i++) {
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (a != b)
break;
else
continue;
}
int a = Integer.parseInt(itemset1[i]);
int b = Integer.parseInt(itemset2[i]);
if (i == itemset2.length - 1 && a != b) {
String keys = s + STRING_SPLIT + itemset2[i];
String key[] = keys.split(STRING_SPLIT);
String Checkkeys = null;
if (a > b) {
String temp;
temp = key[key.length - 1];
key[key.length - 1] = key[key.length - 2];
key[key.length - 2] = temp;
keys = key[0];
for (int j = 0; j < key.length - 1; j++) {
keys = keys + STRING_SPLIT + key[j + 1];
}
}
if (key.length > 2) {
int k = 0;
for (k = 0; k < key.length - 2; k++) {
int end1 = keys.indexOf(key[k]);
int start2 = keys.indexOf(key[k + 1]);
Checkkeys = keys.substring(0, end1)
+ keys.substring(start2, keys.length());
if (!src.contains(Checkkeys))
break;
else
continue;
}
if (k == key.length - 2)
dest.add(keys);
}
if (Checkkeys == null) {
if (!dest.contains(keys)) {
dest.add(keys);
}
}
}
}
}
return dest;
}
public Map<String, Double> cut(Map<String, Double> tempresult,
Double minSupport) {
for (Object key : tempresult.keySet().toArray()) {
if (minSupport != null && minSupport > 0 && minSupport < 1
&& tempresult.get(key) < minSupport) {
tempresult.remove(key);
}
}
return tempresult;
}
public static List<String[]> loadChessData(Text value) throws Exception {
List<String[]> result = new ArrayList<String[]>();
StringTokenizer baskets = new StringTokenizer(value.toString(),
"n");
while (baskets.hasMoreTokens()) {
String[] items = baskets.nextToken().split(" ");
result.add(items);
}
return result;
}
}
public static class CandidateItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(1));
}
}
public static void preprocessingphase1(String[] args) throws Exception {
String originalfilepath = getLocation(args[0]);
System.out.println(originalfilepath);
if (originalfilepath == null)
return;
List<String> lines = readFile(originalfilepath);
if (lines == null)
return;
total = lines.size();
partition = Integer.parseInt(args[1]);
int m = (int) total / partition;
double m_d = total * 1.0 / partition;
if (m_d > m)
m = m + 1;
mkdir("input_temp");
for (int i = 0; i < partition; i++) {
String newpath = "input_temp/" + i + ".dat";
String input_temp = "";
for (int j = 0; j < m && total - i * m - j > 0; j++) {
input_temp += lines.get(i * m + j) + "n";
}
createFile(newpath, input_temp.getBytes());
}
}
public static void preprocessingphase2() throws Exception {
List<String> lines = readFile("output_temp/part-00000");
Iterator<String> itr = lines.iterator();
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
FirstResult.add(itemset);
}
System.out.println("Pre processing for phase 2 finished.");
}
public static class FrequentItemsetMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
String data = value.toString();
String[] baskets = data.split("n");
for (int i = 0; i < FirstResult.size(); i++) {
int number = 0;
String[] items = FirstResult.get(i).split(STRING_SPLIT);
for (int j = 0; j < baskets.length; j++) {
int k = 0;
for (k = 0; k < items.length; k++) {
String[] basketsitemset = baskets[j].split(" ");
if (contain(basketsitemset, items))
continue;
else
break;
}
if (k == items.length) {
number = number + 1;
}
}
output.collect(new Text(FirstResult.get(i)), new IntWritable(
number));
}
}
}
public static class FrequentItemsetReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
if (sum >= s * total)
output.collect(key, new IntWritable(sum));
}
}
public static List<String> readFile(String filePath) throws IOException {
Path f = new Path(filePath);
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(filePath), conf);
FSDataInputStream dis = fs.open(f);
InputStreamReader isr = new InputStreamReader(dis, "utf-8");
BufferedReader br = new BufferedReader(isr);
List<String> lines = new ArrayList<String>();
String str = "";
while ((str = br.readLine()) != null) {
lines.add(str);
}
br.close();
isr.close();
dis.close();
System.out.println("Original file reading complete.");
return lines;
}
public static String getLocation(String path) throws Exception {
Configuration conf = new Configuration();
FileSystem hdfs = FileSystem.get(conf);
Path listf = new Path(path);
FileStatus stats[] = hdfs.listStatus(listf);
String FilePath = stats[0].getPath().toString();
hdfs.close();
System.out.println("Find input file.");
return FilePath;
}
public static void mkdir(String path) throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path srcPath = new Path(path);
boolean isok = fs.mkdirs(srcPath);
if (isok) {
System.out.println("create dir ok.");
} else {
System.out.println("create dir failure.");
}
fs.close();
}
public static void createFile(String dst, byte[] contents)
throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path dstPath = new Path(dst);
FSDataOutputStream outputStream = fs.create(dstPath);
outputStream.write(contents);
outputStream.close();
fs.close();
System.out.println("file " + dst + " create complete.");
}
public static void phase1(String[] args) throws Exception {
s = Double.parseDouble(args[2]);
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Find frequent candidate");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(CandidateItemsetMapper.class);
conf.setReducerClass(CandidateItemsetReducer.class);
conf.setInputFormat(WholeFileInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output_temp"));
JobClient.runJob(conf);
}
// phase 2
public static void phase2(String[] args) throws Exception {
JobConf conf = new JobConf(FrequentItemset_MapReduce.class);
conf.setJobName("Frequent Itemsets Count");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(FrequentItemsetMapper.class);
conf.setReducerClass(FrequentItemsetReducer.class);
FileInputFormat.setInputPaths(conf, new Path("input_temp"));
FileOutputFormat.setOutputPath(conf, new Path("output"));
JobClient.runJob(conf);
}
public static class WholeFileRecordReader implements
RecordReader<LongWritable, Text> {
private FileSplit fileSplit;
private Configuration conf;
private boolean processed = false;
public WholeFileRecordReader(FileSplit fileSplit, Configuration conf)
throws IOException {
this.fileSplit = fileSplit;
this.conf = conf;
}
@Override
public boolean next(LongWritable key, Text value) throws IOException {
if (!processed) {
byte[] contents = new byte[(int) fileSplit.getLength()];
Path file = fileSplit.getPath();
String fileName = file.getName();
FileSystem fs = file.getFileSystem(conf);
FSDataInputStream in = null;
try {
in = fs.open(file);
IOUtils.readFully(in, contents, 0, contents.length);
value.set(contents, 0, contents.length);
} finally {
IOUtils.closeStream(in);
}
processed = true;
return true;
}
return false;
}
@Override
public LongWritable createKey() {
return new LongWritable();
}
@Override
public Text createValue() {
return new Text();
}
@Override
public long getPos() throws IOException {
return processed ? fileSplit.getLength() : 0;
}
@Override
public float getProgress() throws IOException {
return processed ? 1.0f : 0.0f;
}
@Override
public void close() throws IOException {
// do nothing
}
}
public static class WholeFileInputFormat extends
FileInputFormat<LongWritable, Text> {
@Override
protected boolean isSplitable(FileSystem fs, Path filename) {
return false;
}
@Override
public RecordReader<LongWritable, Text> getRecordReader(
InputSplit split, JobConf job, Reporter reporter)
throws IOException {
return new WholeFileRecordReader((FileSplit) split, job);
}
}
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("The number of arguments is less than three.");
return;
}
preprocessingphase1(args);
phase1(args);
preprocessingphase2();
phase2(args);
List<String> lines = readFile("output/part-00000");
Iterator<String> itr = lines.iterator();
File filename = new File("/home/hadoop/Desktop/result.txt");
filename.createNewFile();
try {
BufferedWriter out = new BufferedWriter(new FileWriter(filename));
String firstline = Integer.toString(lines.size()) + "n";
out.write(firstline);
while (itr.hasNext()) {
String basket = (String) itr.next();
String itemset = basket.substring(0, basket.indexOf("t"));
String number = basket.substring(basket.indexOf("t") + 1,
basket.length());
out.write(itemset + "(" + number + ")" + "n");
}
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}

More Related Content

What's hot

The Ring programming language version 1.2 book - Part 78 of 84
The Ring programming language version 1.2 book - Part 78 of 84The Ring programming language version 1.2 book - Part 78 of 84
The Ring programming language version 1.2 book - Part 78 of 84Mahmoud Samir Fayed
Β 
Lecture 01 variables scripts and operations
Lecture 01   variables scripts and operationsLecture 01   variables scripts and operations
Lecture 01 variables scripts and operationsSmee Kaem Chann
Β 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_iNico Ludwig
Β 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingIndicThreads
Β 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Flink Forward
Β 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and IterationsSameer Wadkar
Β 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsFlink Forward
Β 
Programming in Python
Programming in Python Programming in Python
Programming in Python Tiji Thomas
Β 
A Reflective Implementation of an Actor-based Concurrent Context-Oriented System
A Reflective Implementation of an Actor-based Concurrent Context-Oriented SystemA Reflective Implementation of an Actor-based Concurrent Context-Oriented System
A Reflective Implementation of an Actor-based Concurrent Context-Oriented SystemTakuo Watanabe
Β 
The Ring programming language version 1.3 book - Part 82 of 88
The Ring programming language version 1.3 book - Part 82 of 88The Ring programming language version 1.3 book - Part 82 of 88
The Ring programming language version 1.3 book - Part 82 of 88Mahmoud Samir Fayed
Β 
Functions and modules in python
Functions and modules in pythonFunctions and modules in python
Functions and modules in pythonKarin Lagesen
Β 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot NetNeeraj Kaushik
Β 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On RandomnessRanel Padon
Β 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational BiologyAtreyiB
Β 
Memory Management In Python The Basics
Memory Management In Python The BasicsMemory Management In Python The Basics
Memory Management In Python The BasicsNina Zakharenko
Β 

What's hot (20)

The Ring programming language version 1.2 book - Part 78 of 84
The Ring programming language version 1.2 book - Part 78 of 84The Ring programming language version 1.2 book - Part 78 of 84
The Ring programming language version 1.2 book - Part 78 of 84
Β 
Lecture 01 variables scripts and operations
Lecture 01   variables scripts and operationsLecture 01   variables scripts and operations
Lecture 01 variables scripts and operations
Β 
(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i(2) c sharp introduction_basics_part_i
(2) c sharp introduction_basics_part_i
Β 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
Β 
Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced Apache Flink Training: DataStream API Part 2 Advanced
Apache Flink Training: DataStream API Part 2 Advanced
Β 
Flink Batch Processing and Iterations
Flink Batch Processing and IterationsFlink Batch Processing and Iterations
Flink Batch Processing and Iterations
Β 
Elixir
ElixirElixir
Elixir
Β 
Apache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API BasicsApache Flink Training: DataSet API Basics
Apache Flink Training: DataSet API Basics
Β 
Pa2 session 1
Pa2 session 1Pa2 session 1
Pa2 session 1
Β 
Programming in Python
Programming in Python Programming in Python
Programming in Python
Β 
Python programming
Python  programmingPython  programming
Python programming
Β 
A Reflective Implementation of an Actor-based Concurrent Context-Oriented System
A Reflective Implementation of an Actor-based Concurrent Context-Oriented SystemA Reflective Implementation of an Actor-based Concurrent Context-Oriented System
A Reflective Implementation of an Actor-based Concurrent Context-Oriented System
Β 
The Ring programming language version 1.3 book - Part 82 of 88
The Ring programming language version 1.3 book - Part 82 of 88The Ring programming language version 1.3 book - Part 82 of 88
The Ring programming language version 1.3 book - Part 82 of 88
Β 
Functions and modules in python
Functions and modules in pythonFunctions and modules in python
Functions and modules in python
Β 
Parallel Programming With Dot Net
Parallel Programming With Dot NetParallel Programming With Dot Net
Parallel Programming With Dot Net
Β 
Python Programming - IX. On Randomness
Python Programming - IX. On RandomnessPython Programming - IX. On Randomness
Python Programming - IX. On Randomness
Β 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Β 
Programming in Computational Biology
Programming in Computational BiologyProgramming in Computational Biology
Programming in Computational Biology
Β 
Biopython
BiopythonBiopython
Biopython
Β 
Memory Management In Python The Basics
Memory Management In Python The BasicsMemory Management In Python The Basics
Memory Management In Python The Basics
Β 

Similar to DataMiningReport

TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMSTanya Makkar
Β 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure Eman magdy
Β 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packagesAjay Ohri
Β 
Data Structures Notes
Data Structures NotesData Structures Notes
Data Structures NotesRobinRohit2
Β 
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdfADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdfRGPV De Bunkers
Β 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14Max De Koninck
Β 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
Β 
Easy R
Easy REasy R
Easy RAjay Ohri
Β 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxtest1miniproject
Β 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxtest1miniproject
Β 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
Β 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
Β 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduceRomain Jacotin
Β 
Process of algorithm evaluation
Process of algorithm evaluationProcess of algorithm evaluation
Process of algorithm evaluationAshish Ranjan
Β 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DRNguyen Tran
Β 

Similar to DataMiningReport (20)

TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
TIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMSTIME EXECUTION   OF  DIFFERENT SORTED ALGORITHMS
TIME EXECUTION OF DIFFERENT SORTED ALGORITHMS
Β 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Β 
r,rstats,r language,r packages
r,rstats,r language,r packagesr,rstats,r language,r packages
r,rstats,r language,r packages
Β 
genalg
genalggenalg
genalg
Β 
Data Structures Notes
Data Structures NotesData Structures Notes
Data Structures Notes
Β 
UNIT-2-PPTS-DAA.ppt
UNIT-2-PPTS-DAA.pptUNIT-2-PPTS-DAA.ppt
UNIT-2-PPTS-DAA.ppt
Β 
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdfADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
ADA Unit-1 Algorithmic Foundations Analysis, Design, and Efficiency.pdf
Β 
WVKULAK13_submission_14
WVKULAK13_submission_14WVKULAK13_submission_14
WVKULAK13_submission_14
Β 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
Β 
Easy R
Easy REasy R
Easy R
Β 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
Β 
Introduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptxIntroduction to Map-Reduce in Hadoop.pptx
Introduction to Map-Reduce in Hadoop.pptx
Β 
Map reduce
Map reduceMap reduce
Map reduce
Β 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
Β 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
Β 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
Β 
Process of algorithm evaluation
Process of algorithm evaluationProcess of algorithm evaluation
Process of algorithm evaluation
Β 
Clustering_Algorithm_DR
Clustering_Algorithm_DRClustering_Algorithm_DR
Clustering_Algorithm_DR
Β 
Algorithms.
Algorithms. Algorithms.
Algorithms.
Β 
Unit ii algorithm
Unit   ii algorithmUnit   ii algorithm
Unit ii algorithm
Β 

DataMiningReport

  • 1. Report For Data Mining Project Xiumo Zhan xiumoz@sfu.ca Bowen Sun bsa58@sfu.ca Abstract This project use Mapreduce programming to find all frequent itemsets among the transaction in the given file in two passes. We use java as programming language and Eclipse with Hadoop pluggin as the development environment. In this project, we use two passes to implement Mapreduce with the SON algorithm and Apriori algorithm. Finally, our Mapreduce program can achieve the expected result according to the given file, a parameter k as the number of subfiles and a parameter s as support threshold. Implementation Our implementation is applied by SON algorithm. This algorithm consists of two passes, each of which requires one Map function and one Reduce function. The SON algorithm lends itself well to a parallel-computing environment: each of the chunks can be processed in parallel, and the frequent itemsets from each chunk combined to form the candidates. So in order to simulate the parallel computing environment, we build a pseudo distributed model Hadoop environment on the Ubuntu system running as a virtual machine using Vmware workstation in our own laptop. Pass 1 In Pass 1, we first divide the entire big file into k subfiles, and the input of each mapper is one of the k subfiles. Then we implement the Apriori algorithm on each subfile. While applying the Apriori algorithm, we read the entire input file and then divide them to lines, which represent the baskets. We then split the items of each line and use the data structure list<String[]> to store the distinct items, which is our candidate frequent 1-itemset 𝐢𝐢1. Then we compute the support of each items in 𝐢𝐢1 to generate 𝐿𝐿1 and use this 𝐿𝐿1 to form the pairs 𝐢𝐢2. For 𝐢𝐢2, we have to check if it can reach the threshold 𝑠𝑠 and then generate 𝐿𝐿2. For any π‘˜π‘˜ β‰₯ 3, if we want to self join πΏπΏπ‘˜π‘˜βˆ’1 to form πΆπΆπ‘˜π‘˜ we have to compare the first π‘˜π‘˜ βˆ’ 2 elements for each two itemsets in πΏπΏπ‘˜π‘˜βˆ’1. For instance, there are two 3-itemsets β€œ234” and β€œ235” in L3, so we will check if the first two elements in these two itemsets are the same. The pseudo code of this procedure is described in the following: Combine(itemset1, itemset2)
  • 2. set point=0; set key={}; for i=1 to the length of both itemsets if itemset1[i]==itemset2[i] point=point+1; key=key+itemset1[i]; else break; endif endfor if point==length of both itemsets if itemset1[point+1]>itemset2[point+1] key=key+itemset2[point+1]+itemset1[point+1]; else key=key+itemset1[point+1]+itemset2[point+1]; endif endif We can use these β€œ234” and β€œ235” to form β€œ2345”, we have to check if it is qualified to stay in the 𝐢𝐢4. We have known that β€œ234” and β€œ235” are already in the 𝐿𝐿3, so we just need to check if both β€œ245” and ”345” are in the 𝐿𝐿3 instead of checking all four 3-itemsets, which will avoid unnecessary check. In practical programming, we notice that the itemsets that needs to be checked are the set of itemsets containing the last two items and without one of arbitrary k βˆ’ 2 items for the πΆπΆπ‘˜π‘˜ . So we will continue the self join process using πΏπΏπ‘˜π‘˜ until the the generated πΆπΆπ‘˜π‘˜+1 is empty. The pseudo code of our checking procedure can be written as the following: Check(itemset, πΏπΏπ‘˜π‘˜βˆ’1) set flag=1; for i=1 to length of itemset-3 set subitemset[i]=delete itemset[i] from the itemset; if subitemset[i] not exists in πΏπΏπ‘˜π‘˜βˆ’1 set flag=0; break; endif endfor if flag=0 delete this itemset from πΏπΏπ‘˜π‘˜; else keep this itemset in πΏπΏπ‘˜π‘˜; endif The result produced by the mapper is the candidate frequent itemsets of each subfile. Then reducer use our first reduce function to prune the duplicated itemsets in the output of the mapper. We notice that the first reduce function will ignore the value of support for each
  • 3. itemsets, the task of computation of the actual support of each itemset will be assigned to Pass 2. After we produce all the candidate itemsets using Apriori algorithm in Pass 1, we will output a file of all the candidate itemsets in the format of <”itemset”, ”value”>, β€œvalue” is set to be 1 since we need to collect all distinct candidate frequent itemset. Storing all candidate itemsets is necessary for our algorithm to carry on because it will ensure that the candidate itemsets that are produced in Pass 1 will be passed to the next pass. The pseudo code of the whole procedure of Pass 1 can be represented as the following: Class FirstMapper Method FirstMapper( inputfile) set i=1; while(|𝐢𝐢𝑖𝑖| > 0) for each πΆπΆπ‘–π‘–π‘˜π‘˜ in 𝐢𝐢𝑖𝑖 πΆπΆπ‘–π‘–π‘˜π‘˜ =<Cik.key, πΆπΆπ‘–π‘–π‘˜π‘˜. 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠=computesupport(inputfile,Cik)>; endfor 𝐿𝐿𝑖𝑖=cut(𝐢𝐢𝑖𝑖); Result=Result+𝐿𝐿𝑖𝑖; 𝐢𝐢𝑖𝑖+1=self-join(𝐿𝐿𝑖𝑖); i++; Output <Result.key, 1>; Class FirstReducer Method FirstReducer(keys, values) Collect all distinct candidate frequent itemset; Pass 2 In Pass 2, we will first read the output file produced in Pass 1. In Pass 2, the task of mapper is to compute the number of appearance of the candidate frequent itemsets in each subfile. To finish this, we use the subfiles and implement the second Map and Reduce function. The second mapper produces the number of appearance of each candidate frequent itemset in each subfile and transmit the result in the format <”key”,”value”> to the second reducer. The second reducer will sum the values for the same key and will generate a new pair for each candidate frequent itemsets. The reducer then eliminates the itemsets whose value of support ( 𝑠𝑠𝑠𝑠𝑠𝑠(𝑣𝑣𝑣𝑣𝑣𝑣𝑣𝑣 𝑣𝑣𝑣𝑣) 𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 π‘œπ‘œπ‘œπ‘œ π‘π‘π‘π‘π‘π‘π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜π‘˜ ) is smaller than s. The pseudo code can be written as the following:
  • 4. Class SecondMapper Method SecondMapper(result of the pass1,subfile) Result2={}; For each key in the result of pass1 Count=number of baskets key[i] appears in the subfile; Result2=Result2+<key[i], count>; Endfor Return Result2; Class SecondReducer Method SecondReducer(keys, values) for each keys[i] belongs to the input keys values=sum of values for the same key; ComputeAndCut(<keys[i], values>, s); Endfor Output <Keys, Values>; Test We have tried many kinds of input data to test our program. We use data with the transaction and baskets in small size to test the correctness of the result of our program. If our input data is a small transaction but with very big baskets, running time is still huge. If the length of baskets are not very long, the running time can be short. We also tried different k and s and find out that running time is relatively longer if k is too large or too small for the same s. For the same k, if we increase the value of s, running time will be shorter and can be very long if s is very small. So far, our program can finish processing the example.dat given parameters k=30 and s=0.02 within 3 minutes. This efficiency is much better than what we achieved at the beginning, which is longer than one hour. Discussion Running time and memory space consumption are very critical factors affect the efficiency of the program. At first, our program can work well only when the input file has small baskets. While checking the performance of our program, we found a very fatal flaw of our program, that is the way we read our file in the Map function in Pass 1. In the Hadoop Mapper class, the default way of reading files of the mapper is just one line at a time. In fact, the Apriori algorithm needs the program to read the entire input file to count the size of the transaction. In this case, the support threshold we define is not utilized to prune πΆπΆπ‘˜π‘˜, because each itemset in this case will all have the support of 100%. Apriori algorithm, in this case, is actually not working. In fact, it only enumerates all the subsets of the baskets and output the result to the reducer as the candidate frequent itemsets that we need to count the actual support in the next pass. This will need huge memory space and too much time. After searching on the internet, we found that overriding
  • 5. InputFormat class and Recordreader class can solve this problem. Moreover, our program is very sensible to the number of subfiles and the value of the threshold. If we produce too many subfiles, each subfile will be so small that the support of each itemset in the subfile will be relatively large. In this case, the support threshold is also useless and Pass 1 will generate big size of candidate itemsets, which results in huge demand of time and memory space. For the same reason, if s is too small, the number of candidate itemsets will also be large. And if k is very small, each subfile will be very big and it costs a long time to process the whole file since large number of baskets will result in 𝐢𝐢1 with big size, as well as the huge cost of time and memory space. While we have already made some progress to the improvement of the efficiency of our program, the computing of 𝐢𝐢1, 𝐢𝐢2, 𝐿𝐿1 and 𝐿𝐿2 still takes a lot of time. Generating 𝐢𝐢1 is slow because the program split each basket to get items, this procedure will process file extensively. Reading and processing file in this case could be time-costly. Generating 𝐿𝐿1 is always slow when 𝐢𝐢1 is of large size, because the algorithm we use has to traverse all the baskets to count the number of appearance of every item in 𝐢𝐢1. Generating 𝐢𝐢2 from 𝐿𝐿1 is also a time costly step, since if |𝐿𝐿1| = 𝑛𝑛, then |𝐢𝐢2| = 𝑂𝑂(𝑛𝑛2 ). Generating 𝐿𝐿2 is also slow because of the size of 𝐢𝐢2. When k β‰₯ 3, the process will be much faster because the program can effectively cut itemset in πΆπΆπ‘˜π‘˜, using the monotonicity of frequent itemset. Our future approach of improving our program consists of the 4 following aspects. 1.Change the data structure of storing our candidate frequent itemsets. 2.Prune the redundancy of operation and data structure for our algorithm in Pass 1. 3.Combine our current algorithm other algorithms like PCY, multihash to improve the efficiency. 4.Also we need to find ways to determine the proper k for the given file and s. Appendix import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.io.InputStreamReader; import java.net.URI; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.Iterator; import java.util.List; import java.util.Map; import java.util.Set;
  • 6. import java.util.StringTokenizer; import java.util.TreeMap; import java.util.TreeSet; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FSDataInputStream; import org.apache.hadoop.fs.FSDataOutputStream; import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.BytesWritable; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.FileInputFormat; import org.apache.hadoop.mapred.FileOutputFormat; import org.apache.hadoop.mapred.FileSplit; import org.apache.hadoop.mapred.InputSplit; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.MapReduceBase; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.RecordReader; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; import org.apache.hadoop.mapred.TextOutputFormat; public class FrequentItemset_MapReduce { static double s = 0.0; static int total = 0; static int partition = 1; public static final String STRING_SPLIT = ","; static List<String> FirstResult = new ArrayList<String>(); public static IntWritable one = new IntWritable(1); public static boolean contain(String[] src, String[] dest) { for (int i = 0; i < dest.length; i++) { int j = 0; for (; j < src.length; j++) { if (src[j].equals(dest[i])) { break;
  • 7. } } if (j == src.length) { return false;// can not find } } return true; } public static class CandidateItemsetMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { @Override public void map(LongWritable arg0, Text value, OutputCollector<Text, IntWritable> output, Reporter arg3) throws IOException { List<String[]> data = null; try { data = loadChessData(value); } catch (Exception e) { e.printStackTrace(); } Map<String, Double> result = compute(data, s, null, null); for (String key : result.keySet()) { output.collect(new Text(key), one); } } public Map<String, Double> compute(List<String[]> data, Double minSupport, Integer maxLoop, String[] containSet) { if (data == null || data.size() <= 0) { return null; } Map<String, Double> result = new TreeMap<String, Double>(); Map<String, Double> tempresult = new HashMap<String, Double>(); String[] itemSet = getDataUnitSet(data); int loop = 0; // loop1 Set<String> keys = combine(tempresult.keySet(), itemSet); tempresult.clear();
  • 8. for (String key : keys) { tempresult.put(key, computeSupport(data, key.split(STRING_SPLIT))); } cut(tempresult, minSupport); result.putAll(tempresult); loop++; String[] strSet = new String[tempresult.size()]; tempresult.keySet().toArray(strSet); while (true) { keys = combine(tempresult.keySet(), strSet); tempresult.clear(); for (String key : keys) { tempresult.put(key, computeSupport(data, key.split(STRING_SPLIT))); } cut(tempresult, minSupport); strSet = new String[tempresult.size()]; tempresult.keySet().toArray(strSet); result.putAll(tempresult); loop++; if (tempresult.size() <= 0) { break; } if (maxLoop != null && maxLoop > 0 && loop >= maxLoop) { break; } } return result; } public Double computeSupport(List<String[]> data, String[] subSet) { Integer value = 0; for (int i = 0; i < data.size(); i++) { if (contain(data.get(i), subSet)) { value++; } } return value * 1.0 / data.size(); } public String[] getDataUnitSet(List<String[]> data) { List<String> uniqueKeys = new ArrayList<String>();
  • 9. for (String[] dat : data) { for (String da : dat) { if (!uniqueKeys.contains(da)) { uniqueKeys.add(da); } } } // String[] toBeStored = list.toArray(new String[list.size()]); String[] result = uniqueKeys.toArray(new String[uniqueKeys.size()]); return result; } public Set<String> combine(Set<String> src, String[] target) { Set<String> dest = new TreeSet<String>(); if (src == null || src.size() <= 0) { for (String t : target) { dest.add(t.toString()); } return dest; } for (String s : src) { for (String t : target) { String[] itemset1 = s.split(STRING_SPLIT); String[] itemset2 = t.split(STRING_SPLIT); int i = 0; for (i = 0; i < itemset1.length - 1 && i < itemset2.length - 1; i++) { int a = Integer.parseInt(itemset1[i]); int b = Integer.parseInt(itemset2[i]); if (a != b) break; else continue; } int a = Integer.parseInt(itemset1[i]); int b = Integer.parseInt(itemset2[i]); if (i == itemset2.length - 1 && a != b) { String keys = s + STRING_SPLIT + itemset2[i]; String key[] = keys.split(STRING_SPLIT); String Checkkeys = null; if (a > b) { String temp; temp = key[key.length - 1];
  • 10. key[key.length - 1] = key[key.length - 2]; key[key.length - 2] = temp; keys = key[0]; for (int j = 0; j < key.length - 1; j++) { keys = keys + STRING_SPLIT + key[j + 1]; } } if (key.length > 2) { int k = 0; for (k = 0; k < key.length - 2; k++) { int end1 = keys.indexOf(key[k]); int start2 = keys.indexOf(key[k + 1]); Checkkeys = keys.substring(0, end1) + keys.substring(start2, keys.length()); if (!src.contains(Checkkeys)) break; else continue; } if (k == key.length - 2) dest.add(keys); } if (Checkkeys == null) { if (!dest.contains(keys)) { dest.add(keys); } } } } } return dest; } public Map<String, Double> cut(Map<String, Double> tempresult, Double minSupport) { for (Object key : tempresult.keySet().toArray()) { if (minSupport != null && minSupport > 0 && minSupport < 1 && tempresult.get(key) < minSupport) { tempresult.remove(key); } }
  • 11. return tempresult; } public static List<String[]> loadChessData(Text value) throws Exception { List<String[]> result = new ArrayList<String[]>(); StringTokenizer baskets = new StringTokenizer(value.toString(), "n"); while (baskets.hasMoreTokens()) { String[] items = baskets.nextToken().split(" "); result.add(items); } return result; } } public static class CandidateItemsetReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(1)); } } public static void preprocessingphase1(String[] args) throws Exception { String originalfilepath = getLocation(args[0]); System.out.println(originalfilepath); if (originalfilepath == null) return; List<String> lines = readFile(originalfilepath); if (lines == null) return; total = lines.size(); partition = Integer.parseInt(args[1]); int m = (int) total / partition; double m_d = total * 1.0 / partition; if (m_d > m) m = m + 1; mkdir("input_temp"); for (int i = 0; i < partition; i++) {
  • 12. String newpath = "input_temp/" + i + ".dat"; String input_temp = ""; for (int j = 0; j < m && total - i * m - j > 0; j++) { input_temp += lines.get(i * m + j) + "n"; } createFile(newpath, input_temp.getBytes()); } } public static void preprocessingphase2() throws Exception { List<String> lines = readFile("output_temp/part-00000"); Iterator<String> itr = lines.iterator(); while (itr.hasNext()) { String basket = (String) itr.next(); String itemset = basket.substring(0, basket.indexOf("t")); FirstResult.add(itemset); } System.out.println("Pre processing for phase 2 finished."); } public static class FrequentItemsetMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String data = value.toString(); String[] baskets = data.split("n"); for (int i = 0; i < FirstResult.size(); i++) { int number = 0; String[] items = FirstResult.get(i).split(STRING_SPLIT); for (int j = 0; j < baskets.length; j++) { int k = 0; for (k = 0; k < items.length; k++) { String[] basketsitemset = baskets[j].split(" "); if (contain(basketsitemset, items)) continue; else break; } if (k == items.length) { number = number + 1; } } output.collect(new Text(FirstResult.get(i)), new IntWritable(
  • 13. number)); } } } public static class FrequentItemsetReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } if (sum >= s * total) output.collect(key, new IntWritable(sum)); } } public static List<String> readFile(String filePath) throws IOException { Path f = new Path(filePath); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create(filePath), conf); FSDataInputStream dis = fs.open(f); InputStreamReader isr = new InputStreamReader(dis, "utf-8"); BufferedReader br = new BufferedReader(isr); List<String> lines = new ArrayList<String>(); String str = ""; while ((str = br.readLine()) != null) { lines.add(str); } br.close(); isr.close(); dis.close(); System.out.println("Original file reading complete."); return lines; } public static String getLocation(String path) throws Exception { Configuration conf = new Configuration(); FileSystem hdfs = FileSystem.get(conf); Path listf = new Path(path); FileStatus stats[] = hdfs.listStatus(listf); String FilePath = stats[0].getPath().toString();
  • 14. hdfs.close(); System.out.println("Find input file."); return FilePath; } public static void mkdir(String path) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path srcPath = new Path(path); boolean isok = fs.mkdirs(srcPath); if (isok) { System.out.println("create dir ok."); } else { System.out.println("create dir failure."); } fs.close(); } public static void createFile(String dst, byte[] contents) throws IOException { Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); Path dstPath = new Path(dst); FSDataOutputStream outputStream = fs.create(dstPath); outputStream.write(contents); outputStream.close(); fs.close(); System.out.println("file " + dst + " create complete."); } public static void phase1(String[] args) throws Exception { s = Double.parseDouble(args[2]); JobConf conf = new JobConf(FrequentItemset_MapReduce.class); conf.setJobName("Find frequent candidate"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(CandidateItemsetMapper.class); conf.setReducerClass(CandidateItemsetReducer.class); conf.setInputFormat(WholeFileInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("input_temp")); FileOutputFormat.setOutputPath(conf, new Path("output_temp")); JobClient.runJob(conf); }
  • 15. // phase 2 public static void phase2(String[] args) throws Exception { JobConf conf = new JobConf(FrequentItemset_MapReduce.class); conf.setJobName("Frequent Itemsets Count"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(FrequentItemsetMapper.class); conf.setReducerClass(FrequentItemsetReducer.class); FileInputFormat.setInputPaths(conf, new Path("input_temp")); FileOutputFormat.setOutputPath(conf, new Path("output")); JobClient.runJob(conf); } public static class WholeFileRecordReader implements RecordReader<LongWritable, Text> { private FileSplit fileSplit; private Configuration conf; private boolean processed = false; public WholeFileRecordReader(FileSplit fileSplit, Configuration conf) throws IOException { this.fileSplit = fileSplit; this.conf = conf; } @Override public boolean next(LongWritable key, Text value) throws IOException { if (!processed) { byte[] contents = new byte[(int) fileSplit.getLength()]; Path file = fileSplit.getPath(); String fileName = file.getName(); FileSystem fs = file.getFileSystem(conf); FSDataInputStream in = null; try { in = fs.open(file); IOUtils.readFully(in, contents, 0, contents.length); value.set(contents, 0, contents.length); } finally { IOUtils.closeStream(in); } processed = true; return true;
  • 16. } return false; } @Override public LongWritable createKey() { return new LongWritable(); } @Override public Text createValue() { return new Text(); } @Override public long getPos() throws IOException { return processed ? fileSplit.getLength() : 0; } @Override public float getProgress() throws IOException { return processed ? 1.0f : 0.0f; } @Override public void close() throws IOException { // do nothing } } public static class WholeFileInputFormat extends FileInputFormat<LongWritable, Text> { @Override protected boolean isSplitable(FileSystem fs, Path filename) { return false; } @Override public RecordReader<LongWritable, Text> getRecordReader( InputSplit split, JobConf job, Reporter reporter) throws IOException { return new WholeFileRecordReader((FileSplit) split, job); } }
  • 17. public static void main(String[] args) throws Exception { if (args.length < 3) { System.out.println("The number of arguments is less than three."); return; } preprocessingphase1(args); phase1(args); preprocessingphase2(); phase2(args); List<String> lines = readFile("output/part-00000"); Iterator<String> itr = lines.iterator(); File filename = new File("/home/hadoop/Desktop/result.txt"); filename.createNewFile(); try { BufferedWriter out = new BufferedWriter(new FileWriter(filename)); String firstline = Integer.toString(lines.size()) + "n"; out.write(firstline); while (itr.hasNext()) { String basket = (String) itr.next(); String itemset = basket.substring(0, basket.indexOf("t")); String number = basket.substring(basket.indexOf("t") + 1, basket.length()); out.write(itemset + "(" + number + ")" + "n"); } out.flush(); out.close(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }