> Some of you may know this... I'd like to write a program that will match up a
> description with a specific code. I'm trying to find out the kind of algorithms
> and techniques that I should research.
> For example: I have a field with a description in uppercase, let's say "100 PAGE
> BOOK WITH RED COVER" and "3 1/2 INCH HARD DISK DRIVE UNIT". Now let's imagine
> that the first would match a category with code 12345 and the second a category
> with code 98765.
There's something obscure in this. I didn't get how do you plan to distinguish the categories... what puts the first string in 12345? The word "RED", the word "BOOK", the word "COVER", a combination of the three? Same thing about 98765... "DRIVE" maybe?
If it's just about keywords, it's pretty easy.
> This is very simple in theory (switch/case and a bit of fuzzy logic to handle
> misspellings), but in practice I have a lot of codes to map to and I could have
> any kind of description (in a fixed-width field though). Now of course I don't
> expect there is any way to yield good results at every time, but I'm trying to
> figure out where to spend research time on (what area).
If you can write a context-free grammar for the descriptions then you can feed it to any yacc-like parser generator like bison or jay, and it will generate a parser for you in a moment. Write sensible semantic actions bound to the recognition of high-level nonterminals (which could be "Book" or "Drive" in your case) and you are done.
Of course you may need to preprocess the descriptions a bit to clean misstypings and throw away useless tokens before feeding the strings to the parser, or the grammar is going to be full of conflicts and the parser generator will complain. This can be done while scanning the string, so maybe the lexer will take some time. Just plan things beforehand: a well-written grammar is half the job.
> The goal of this is to parse information on a huge DB to a simplified one, so a
> search frontend only has to display categories. This could be done by hand but
> the number of entries make it extremely long, and I expect this would make the
> processing of this information much much faster (i.e. days => minutes).
> Any ideas?
If you're in such a rush, maybe there are faster ways than writing a LARL(1) parser :P