[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

VMS-List-Archive Search Engine (was: AW: AW: VMs: introducing myself and 2 questions :))



Hi Nick,

>Commendably, GC has already done this (in a 23MB PDF) on his
>http://www.baconbooks.net/ - you can find it on
>http://www.baconbooks.net/Voynich/voynich.htm to be precise.

thanks a lot! :)

>Thank heavens for broadband, eh? :-)

yeah, but i wish i head a synchron-line instead of asynchron when it
comes to uploading 25megs of an old mailinglistarchive to a webserver ;)

>Personally, I prefer having individual files to the PDF (perhaps because
>I'm used to using PhotoShop and Debabelizer Pro, etc) - but feel free to
>use whatever works best for you. :-)

i like to have the "big picture" of the book because im at the beginning,
so i use a web-photoalbum-software where i can see the pages as thumbs and
2 on a site like in the book.

>of trawling the archives is to find what has been discussed about a
>particular topic or person (say, Edward Kelly, Pietro d'Abano, alchemical
>projection, etc), which is a kind of glorified grep.

yes, for the main search process i agree, but or example the date-field
in the database would be very helpful when it comes to order emails by date
and so on.

>The majority of searches, then, would be content-field-based rather than
>structure-field-based - which is not really where you want to be at with
>mySQL. :-(

i did a quick-dirty-check to measure the performance.
i cut the emailfiles (all from 91 to end of 2001) by the "date:"-field
and pushed them into a sql-db on a live-server (dedicated
singleprocessorserver running linux) and testet it via a dsl-connection.
i ended up with about 7500 seperate entrys (one for each cut-item aka mails)
but of course there will be a lot of errors in it because of tofu-quotes
etc..
if the cut-programm works correct in the future the number of entrys i think
will be less than 7500, but the mass (currently 26mb) will be the same.
so i think the performanceresults right now will be somewhat comparable to
the future. :)
i did 2 tests and measured the running time including output. with
runningtime
i mean only the time the script runs on the server. to this one should add
the time thats needed to download and display the results in the users
browser.
the script shows the emailnummers in the database where the result was
found and gives the total count of found items. that will be enough for
making a result-site where the user can select single results out of it for
reading.

Test1: one searchterm ("bacon")
Result: 221 Emails, 0.269sec Searchtime

Test2: 3 searchterms "and"-search ("bacon", "dee", "herbal")
Result: 17 Emails, 0,248sec Searchtime

Test3: 4 searchterms "or"-search ("f1v", "f2r", "herbal", "bacon")
Result: 890 Emails, 0,709sec Searchtime

of course thats without giving the results a ranking like "email #234 has
3times bacon in it, so it must be displayed at the top", but i think the
search is fast enough for using it.

>There's also a lot of quoting going on, which might throw that out - check
>carefully that you'll get what you expect. Apps like MHonArc work hard to
>divide stuff up sensibly, but not always successfully.

yes, youre right, i found a lot of fullquotes and stuff. :/
my idea is to find the correct cut-point by searching for 2 or 3 required
headerfileds that define an email according to rfc822 and combine this with
a search for quotingmarks in front of these fields to identify quoted
headerfields. maybe this would work.

>Absolutely - but the mistakes you make at the very first stage are likely
>to stay with you for a good while, so it's worth trying to do it right. :-o

of course! :)
fortunately the email-textfiles stay with you, so it is easy to correct
mistakes made.

>Like most web-crawlers, Google will index dynamic pages fine, as long as
>they have a unique URL (ie, ?msg=0045667 etc) to be referenced by, and a

ah ok, thats what i meant :)

>Feel free to email the list as your thoughts develop, we'll try to move it
>forward between us. :-)

ok, please give me a sign when the techno-babble starts disturbing
some of you other members on the list :)

kind regards,
sebastian
--
www.sinandsoul.com

______________________________________________________________________
To unsubscribe, send mail to majordomo@xxxxxxxxxxx with a body saying:
unsubscribe vms-list