Welcome to Emulationworld

Forum Index | FAQ | New User | Login | Search

Make a New PostPrevious ThreadView All ThreadsNext Thread*Show in Threaded Mode


Subjecttext pattern recognition new Reply to this message
Posted bynewsdee
Posted on10/06/04 06:37 PM



Some of you may know this... I'd like to write a program that will match up a description with a specific code. I'm trying to find out the kind of algorithms and techniques that I should research.

For example: I have a field with a description in uppercase, let's say "100 PAGE BOOK WITH RED COVER" and "3 1/2 INCH HARD DISK DRIVE UNIT". Now let's imagine that the first would match a category with code 12345 and the second a category with code 98765.

This is very simple in theory (switch/case and a bit of fuzzy logic to handle misspellings), but in practice I have a lot of codes to map to and I could have any kind of description (in a fixed-width field though). Now of course I don't expect there is any way to yield good results at every time, but I'm trying to figure out where to spend research time on (what area).

The goal of this is to parse information on a huge DB to a simplified one, so a search frontend only has to display categories. This could be done by hand but the number of entries make it extremely long, and I expect this would make the processing of this information much much faster (i.e. days => minutes).

Any ideas?


[download a life]


SubjectRe: text pattern recognition new Reply to this message
Posted byTerry Bogard
Posted on10/07/04 05:49 AM



> Some of you may know this... I'd like to write a program that will match up a
> description with a specific code. I'm trying to find out the kind of algorithms
> and techniques that I should research.
>
> For example: I have a field with a description in uppercase, let's say "100 PAGE
> BOOK WITH RED COVER" and "3 1/2 INCH HARD DISK DRIVE UNIT". Now let's imagine
> that the first would match a category with code 12345 and the second a category
> with code 98765.

There's something obscure in this. I didn't get how do you plan to distinguish the categories... what puts the first string in 12345? The word "RED", the word "BOOK", the word "COVER", a combination of the three? Same thing about 98765... "DRIVE" maybe?

If it's just about keywords, it's pretty easy.

> This is very simple in theory (switch/case and a bit of fuzzy logic to handle
> misspellings), but in practice I have a lot of codes to map to and I could have
> any kind of description (in a fixed-width field though). Now of course I don't
> expect there is any way to yield good results at every time, but I'm trying to
> figure out where to spend research time on (what area).

If you can write a context-free grammar for the descriptions then you can feed it to any yacc-like parser generator like bison or jay, and it will generate a parser for you in a moment. Write sensible semantic actions bound to the recognition of high-level nonterminals (which could be "Book" or "Drive" in your case) and you are done.

Of course you may need to preprocess the descriptions a bit to clean misstypings and throw away useless tokens before feeding the strings to the parser, or the grammar is going to be full of conflicts and the parser generator will complain. This can be done while scanning the string, so maybe the lexer will take some time. Just plan things beforehand: a well-written grammar is half the job.

> The goal of this is to parse information on a huge DB to a simplified one, so a
> search frontend only has to display categories. This could be done by hand but
> the number of entries make it extremely long, and I expect this would make the
> processing of this information much much faster (i.e. days => minutes).
>
> Any ideas?

If you're in such a rush, maybe there are faster ways than writing a LARL(1) parser :P

OKKAY!


SubjectRe: text pattern recognition new Reply to this message
Posted bysmf
Posted on10/07/04 03:11 PM



> If you're in such a rush, maybe there are faster ways than writing a LARL(1)
> parser :P

sound a like and fuzzy logic aren't going to be very reliable, normally you'd just reduce it to a list of unique values and then hand check the results to remove the mispellings ( technically you've only got a problem if there are multiple spellings ).

smf





Subjectthanks both! new Reply to this message
Posted bynewsdee
Posted on10/07/04 04:07 PM



Thanks for both answers guys. I'll take a look into yacc.

> sound a like and fuzzy logic aren't going to be very reliable, normally you'd
> just reduce it to a list of unique values and then hand check the results to
> remove the mispellings ( technically you've only got a problem if there are
> multiple spellings ).

Precisely, there can be mispellings too. I've seen "BK" for book, in which cases I would have to parse stuff depending on other values.

The code works like this, the first two digits is a general category (let's say "12" is for printed material), and then each other number are sub-categories ("34" for hardcover books, "5" for red).

Don't blame me, I didn't make the old system :)




[download a life]


SubjectRe: thanks both! new Reply to this message
Posted bysmf
Posted on10/08/04 03:13 AM



So the code is made up entirely from the description? Yuck, it would be much better to replace the description with a lookup into another table. Or is this what you're trying to do?

If it were me, I'd have the correct descriptions in one table and generate the list of descriptions that weren't in the table & do replacements for eg. BK -> BOOK until the list of exceptions was empty.

If you don't have a list of codes/descriptions in a table then it makes it harder, soundex style searches can be useful but it depends on whether you have lots of words that sound alike.

How many codes are you talking about anyway?

smf





SubjectRe: mapping codes Reply to this message
Posted bynewsdee
Posted on10/09/04 11:36 AM



> So the code is made up entirely from the description? Yuck, it would be much
> better to replace the description with a lookup into another table. Or is
> this what you're trying to do?

Kind of. The items come in with a detailed description (i.e. quantity, etc) and the program should map those into codes that carry a generic description (i.e. what it is only). The codes are pre-existent and known (in other words I don't have to generate the code, just match it).

I would indeed write all the codes in a separate table for easy lookup anyway. There are a LOT of codes though. I don't know how many exactly, but at least 100,000! Fortunately, they have some kind of hierachy. For example every sub-category under the "1234" belong to the same category. Thus, in theory, I should be able to make a first pass sorting things into generic categories, and then subsequently sort them into sub categories. If I can already sort these into generic categories I'll consider it's a success :)

Handling a list of exceptions is a great idea. I'm thinking that I could test algorithms quickly by making them in PHP/mySQL and measure time it takes to process a fixed number of entries, then randomly checking a significant number of results to measure accuracy (lets say processing 1000 items, checking 100 results to get a % of success). Then go with whatever's most accurate.

All this is because I saw a friend working with a system like that, in which they do things via manual entry (ugh). When I heard that I immediately thought for myself "a program could help here", but on second thought I realized that I really didn't know what it would take. Then I remembered the fuzzy logic MAME uses to guess the game to run, and I thought there might be a correlation :)






[download a life]


SubjectRe: mapping codes new Reply to this message
Posted byVideoman
Posted on10/16/04 03:54 PM



> > So the code is made up entirely from the description? Yuck, it would be much
> > better to replace the description with a lookup into another table. Or is
> > this what you're trying to do?
>
> Kind of. The items come in with a detailed description (i.e. quantity, etc) and
> the program should map those into codes that carry a generic description (i.e.
> what it is only). The codes are pre-existent and known (in other words I don't
> have to generate the code, just match it).
>
> I would indeed write all the codes in a separate table for easy lookup anyway.
> There are a LOT of codes though. I don't know how many exactly, but at least
> 100,000! Fortunately, they have some kind of hierachy. For example every
> sub-category under the "1234" belong to the same category. Thus, in theory, I
> should be able to make a first pass sorting things into generic categories, and
> then subsequently sort them into sub categories. If I can already sort these
> into generic categories I'll consider it's a success :)
>
> Handling a list of exceptions is a great idea. I'm thinking that I could test
> algorithms quickly by making them in PHP/mySQL and measure time it takes to
> process a fixed number of entries, then randomly checking a significant number
> of results to measure accuracy (lets say processing 1000 items, checking 100
> results to get a % of success). Then go with whatever's most accurate.
>
> All this is because I saw a friend working with a system like that, in which
> they do things via manual entry (ugh). When I heard that I immediately thought
> for myself "a program could help here", but on second thought I realized that I
> really didn't know what it would take. Then I remembered the fuzzy logic MAME
> uses to guess the game to run, and I thought there might be a correlation :)
>
>
>
>
>
>
> [download a life]
>

That actually sounds like a really neat research project. As I'm understanding this, you have a set of data, containing free-form textual description fields, but the majority conform to some sort of standard, while there are others that do not, and you want to convert/reduce those, into some sort of machine-readable, hierachially-encoded representations instead.

It actually sounds similar to an idea that I had in the past, that would allow a computer to possibly "learn" english sentence structure, from being presented with a large body of sample text, and by determining the probibility of relations to adjacent words, and the other words included in the sentance, and from those derive some sort of "meaning" from those relationships. Really, it would be some sort of neural-net kind of thing. I'm not highly studied in that field, but I understand that it is basically some sort of summation function of probabilities, which is "trained" by a known input set, to calculate the weights internally, and then you feed it the unknown input set, and it spits out what it thinks that they should be, with the opportunity to give it manual feedback if the output is correct or not.

Basically, I was thinking that if you could do something like this, then you could write some sort of inference pattern-matching system, that given a free-form text input field, it would try to match keywords and relationships between then, and any non-matching (unknown) keywords would be inferred based on the collective relation of the known keywords, and also the relation of the unknown keywords to the known ones.

"fuzzy" contextual relationship matching, essentially.

How one would go about implementing such a thing, I'm not exactly sure, but it would be really neat to create a workable system like that.

Of course the less complicated implementation, would be as you mentioned, simply process the "known" entries, and filter out the unknown ones, and then have a human do the conversion for those, since our neural inferencing engine still seems to be orders-of-magnitude more functional and effective than a machine's. :)





Previous ThreadView All ThreadsNext Thread*Show in Threaded Mode