The original version of this program was retrieved from
http://www.cs.su.oz.au/~scilect/sherlock
which was then ported to Plan 9 by myself (../README).
As such, this version properly handles unicode.
See ./diff for a diff(1) output between the original
and Plan 9 versions.
Manual page forthcoming...
The following are notes that were included in the
source of the original version of this program:
* This program takes filenames given on the command line,
* and reads those files into memory, then compares them
* all pairwise to find those which are most similar.
*
* It uses a digital signature generation scheme to randomly
* discard information, thus allowing a better match.
* Essentially it hashes up N adjacent 'words' of input,
* and semi-randomly throws away many of the hashed
* values so that it become hard to hide the plagiarised text.
--
FUNCTIONS:
char *read_word(Biobuf *bin, int *length, char *ignore, char *punct)
* read_word: read a 'word' from the input, ignoring leading characters
* which are inside the 'ignore' string, and stopping if one of
* the 'ignore' or 'punct' characters is found.
* Uses memory allocation to avoid buffer overflow problems.
--
* Let f1 == filesize(file1) == A+B
* and f2 == filesize(file2) == A+C
* where A is the similar section and B or C are dissimilar
*
* Similarity = 100 * A / (f1 + f2 - A)
* = 100 * A / (A+B + A+C - A)
* = 100 * A / (A+B+C)
*
* Thus if A==B==C==n the similarity will be 33% (one third)
* This is desireable since we are finding the ratio of similarities
* as a fraction of (similarities+dissimilarities).
*
* The other way of doing things would be to find the ratio of
* the sum of similarities as a fraction of total file size:
* Similarity = 100 * (A+A) / (A+B + A+C)
* This produces higher percentages and more false matches.
|