COMP9313 2018s2 Project 3

Problem Definition:

Given two collections of records R and S, a similarity function sim(., .), and a threshold τ, the set similarity join between R and S, is to find all record pairs r (from R) and s (from S), such that sim(r, s) >= τ. We compute sim(., .) using the Jaccard similarity in this project.

Given the above example, and set τ=0.5, the results are (r1, s1) (similarity 0.75), (r2, s2) (similarity 0.5), (r3, s1) (similarity 0.5), (r3, s2) (similarity 0.5).

Input files:

Each set is stored in one text file, and each line is in format of: “RecordId list<ElementId>”. Two example input files are as below (integers are separated by space):

Another small test data set can be downloaded at (τ=0.1): https://webcms3.cse.unsw.edu.au/COMP9313/18s2/resources/21255Output:The output file contains the similar pairs together with their similarity scores. Each line is in format of “(RecordId1,RecordId2)tSimilarity” (RecordId1 is from the first file and RecordId2 is from the second file). Round the similarities to six decimal places (you can use BigDecimal to do this).

The pairs are sorted in ascending order by the first record and then the second. Given the example input data, the output file is like:

Code format:

Name the package as “comp9313.proj3” and your scala file as “SetSimJoin.scala”. Your program should take four parameters: the input file 1, the input file 2, the output folder, and the similarity threshold τ (double precision).

Cluster configuration:

Create an S3 bucket with name “comp9313.<YOUR_STUDENTID>” in AWS. Create a folder “project3” in this bucket for holding the input files.

This project aims to let you see the power of distributed computation. Your code should scale well with the number of nodes used in a cluster. You are required to create three clusters in AWS to run the same job:

· Cluster1 – 2 node of instance type m3.xlarge;

· Cluster2 – 3 nodes of instance type m3.xlarge;

· Cluster3 – 4 nodes of instance type m3.xlarge.

Select release EMR-5.17.0 when creating each cluster. Unzip and upload the following data set to your S3 bucket, and set τ to 0.85 to run your program:

https://webcms3.cse.unsw.edu.au/COMP9313/18s2/resources/21256

Record the runtime on each cluster and draw a figure where the x-axis is the number of nodes you used, and the y-axis is the time of getting the result.

Store this figure in a file “Runtime.jpg”. Please also take a screenshot of running your program on AWS in each cluster as a proof of the runtime. Compress the three screenshots into a zip file “Screenshots.zip”. Briefly describe your optimization techniques in a file “ `Optimization.pdf`  ”.

NotesCreate a project locally in Eclipse, test everything in your local computer, and finally do it in AWS EMR.