/usr/web/sources/contrib/akumar/cmd/sherlock/NOTES

Plan 9 from Bell Labs’s /usr/web/sources/contrib/akumar/cmd/sherlock/NOTES

The original version of this program was retrieved from
	http://www.cs.su.oz.au/~scilect/sherlock
which was then ported to Plan 9 by myself (../README).
As such, this version properly handles unicode.
See ./diff for a diff(1) output between the original
and Plan 9 versions.
Manual page forthcoming...

The following are notes that were included in the
source of the original version of this program:

 *  This program takes filenames given on the command line,
 *  and reads those files into memory, then compares them
 *  all pairwise to find those which are most similar.
 *
 *  It uses a digital signature generation scheme to randomly
 *  discard information, thus allowing a better match.
 *  Essentially it hashes up N adjacent 'words' of input,
 *  and semi-randomly throws away many of the hashed
 *  values so that it become hard to hide the plagiarised text.

--

FUNCTIONS:
char *read_word(Biobuf *bin, int *length, char *ignore, char *punct)
  *  read_word: read a 'word' from the input, ignoring leading characters
  *  which are inside the 'ignore' string, and stopping if one of
  *  the 'ignore' or 'punct' characters is found.
  *  Uses memory allocation to avoid buffer overflow problems.

--

 *  Let f1 == filesize(file1) == A+B
 *  and f2 == filesize(file2) == A+C
 *  where A is the similar section and B or C are dissimilar
 *
 *  Similarity = 100 * A / (f1 + f2 - A)
 *             = 100 * A / (A+B + A+C - A)
 *             = 100 * A / (A+B+C)
 *
 *  Thus if A==B==C==n the similarity will be 33% (one third)
 *  This is desireable since we are finding the ratio of similarities
 *  as a fraction of (similarities+dissimilarities).
 *
 *  The other way of doing things would be to find the ratio of
 *  the sum of similarities as a fraction of total file size:
 *  Similarity = 100 * (A+A) / (A+B + A+C)
 *  This produces higher percentages and more false matches.
(Return to Plan 9 Home Page)