[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: VMS-List-Archive Search Engine (was: AW: AW: VMs: introducing myself and 2 questions :))
To make this work it would be best to tokenise the mail contents as they
used to do in home computers when writing text style adventure programs.
Most common words would have a one byte token residing above 127 in the
ascii table. If wide character representations of two bytes per char were
used this could be extended enormously. To retrieve words it is simply a
case of finding the position in a lookup table using the token as index. To
search the archive, first and search terms are tokenised. This would have an
overhead but would certainly speed up the search and retrieval process. A
php script or some other mechanism could be used to feed back the search
results. The arduous task is to tokenize the mail arcgive in the first
place.
Jeff
Sebastian Unterreitmeier sebastian@xxxxxxxxxxxxxx wrote
>
> yeah, but i wish i head a synchron-line instead of asynchron when it
> comes to uploading 25megs of an old mailinglistarchive to a webserver ;)
>
>
> i like to have the "big picture" of the book because im at the beginning,
> so i use a web-photoalbum-software where i can see the pages as thumbs and
> 2 on a site like in the book.
>
>
> yes, for the main search process i agree, but or example the date-field
> in the database would be very helpful when it comes to order emails by
date
> and so on.
>
>
> i did a quick-dirty-check to measure the performance.
> i cut the emailfiles (all from 91 to end of 2001) by the "date:"-field
> and pushed them into a sql-db on a live-server (dedicated
> singleprocessorserver running linux) and testet it via a dsl-connection.
> i ended up with about 7500 seperate entrys (one for each cut-item aka
mails)
> but of course there will be a lot of errors in it because of tofu-quotes
> etc..
> if the cut-programm works correct in the future the number of entrys i
think
> will be less than 7500, but the mass (currently 26mb) will be the same.
> so i think the performanceresults right now will be somewhat comparable to
> the future. :)
> i did 2 tests and measured the running time including output. with
> runningtime
> i mean only the time the script runs on the server. to this one should add
> the time thats needed to download and display the results in the users
> browser.
> the script shows the emailnummers in the database where the result was
> found and gives the total count of found items. that will be enough for
> making a result-site where the user can select single results out of it
for
> reading.
>
______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list