============================
seft (search engine for text)
============================
Usage: seft [OPTIONS] "query terms" text_files
Seft takes a set of query terms and a set of files as arguments and, using
a locality-based similarity heuristic, determines word locations within the
files that are of interest with respect to the query. The user is then
presented with a sequence of windows of text, the first window surrounding
the most relevant location, the second window surrounding the next most
relevant location and so on. Both the number of windows presented and the
size of the window can be specified as parameters to seft. In addition,
the user can specify whether to apply case-folding and/or stemming to the
query terms and the text files.
[OPTIONS]
-f query_file A text file containing query terms
-m max_windows Specifies the maximum number of windows to display
(default = 5)
-w window_size Specifies the number of lines within a window
(default = 3)
-x Turns off high-lighting of query term locations
-n Suppress output
-s [0|1|2] 0 = casefolding off, stemming off
1 = casefolding on, stemming off
2 = casefolding on, stemming on
(default = 2)
-p Print a formfeed character after every window.
Useful when piping output through a pager such as
more.
Examples of usage:
------------------
Consider that the text file Query has the contents "computer industry"
then the following seft examples have the same meaning:
seft -f Query ~oldk/News/*
seft "computer industry" ~oldk/News/*
These commands would have the effect of searching through a users News folder
for articles relating to "computer" and "industry", and returning
windows of text surrounding the most relevant locations of text.
Window merging:
---------------
If highly ranked query locations lie in close proximity, then it
is likely that seft would display either windows which contain
the same contents (the highly ranked query terms exist on the same
line) or windows which partially overlap. To avoid this, the current
version of seft does not display windows whose centre line has already
been displayed (anywhere) within a previous window.
Further work:
-------------
- Piping from stdin:
The current implementation of seft does not allow the text files
(to be searched) to be piped into seft, as in:
cat ~oldk/News/* | seft -q "computer industry"
- Document delimiters:
Currently, the character "^B" is used as a document delimiter.
Such a delimiter could be set as a command line argument or set
in a resource file.
References:
-----------
A detailed discussion of seft can be found in:
@inproceedings{dm00:acsc,
author = "O. de~Kretser and A. Moffat",
title = "Needles and Haystacks: A Search Engine for Personal
Information Collections",
booktitle = "Proc. 23rd Australasian Computer Science Conference",
year = 2000,
note = "To appear",
}
The locality-based ranking heuristic used by seft is described in:
@inproceedings{dm99:adc,
author = "O. de~Kretser and A. Moffat",
title = "Effective Document Presentation with a Locality-Based
Similarity Heuristic",
booktitle = "Proceedings of the Twenty-Second Annual International
ACM-SIGIR Conference on Research and Development in
Information Retrieval",
pages = "113-120",
month = aug,
year = 1999,
address = "San Francisco, CA",
editor = "M. Hearst and F. Gey and R. Tong",
}
Owen de Kretser, 1/2000
|