> > So the code is made up entirely from the description? Yuck, it would be much
> > better to replace the description with a lookup into another table. Or is
> > this what you're trying to do?
> Kind of. The items come in with a detailed description (i.e. quantity, etc) and
> the program should map those into codes that carry a generic description (i.e.
> what it is only). The codes are pre-existent and known (in other words I don't
> have to generate the code, just match it).
> I would indeed write all the codes in a separate table for easy lookup anyway.
> There are a LOT of codes though. I don't know how many exactly, but at least
> 100,000! Fortunately, they have some kind of hierachy. For example every
> sub-category under the "1234" belong to the same category. Thus, in theory, I
> should be able to make a first pass sorting things into generic categories, and
> then subsequently sort them into sub categories. If I can already sort these
> into generic categories I'll consider it's a success :)
> Handling a list of exceptions is a great idea. I'm thinking that I could test
> algorithms quickly by making them in PHP/mySQL and measure time it takes to
> process a fixed number of entries, then randomly checking a significant number
> of results to measure accuracy (lets say processing 1000 items, checking 100
> results to get a % of success). Then go with whatever's most accurate.
> All this is because I saw a friend working with a system like that, in which
> they do things via manual entry (ugh). When I heard that I immediately thought
> for myself "a program could help here", but on second thought I realized that I
> really didn't know what it would take. Then I remembered the fuzzy logic MAME
> uses to guess the game to run, and I thought there might be a correlation :)
> [download a life]
That actually sounds like a really neat research project. As I'm understanding this, you have a set of data, containing free-form textual description fields, but the majority conform to some sort of standard, while there are others that do not, and you want to convert/reduce those, into some sort of machine-readable, hierachially-encoded representations instead.
It actually sounds similar to an idea that I had in the past, that would allow a computer to possibly "learn" english sentence structure, from being presented with a large body of sample text, and by determining the probibility of relations to adjacent words, and the other words included in the sentance, and from those derive some sort of "meaning" from those relationships. Really, it would be some sort of neural-net kind of thing. I'm not highly studied in that field, but I understand that it is basically some sort of summation function of probabilities, which is "trained" by a known input set, to calculate the weights internally, and then you feed it the unknown input set, and it spits out what it thinks that they should be, with the opportunity to give it manual feedback if the output is correct or not.
Basically, I was thinking that if you could do something like this, then you could write some sort of inference pattern-matching system, that given a free-form text input field, it would try to match keywords and relationships between then, and any non-matching (unknown) keywords would be inferred based on the collective relation of the known keywords, and also the relation of the unknown keywords to the known ones.
"fuzzy" contextual relationship matching, essentially.
How one would go about implementing such a thing, I'm not exactly sure, but it would be really neat to create a workable system like that.
Of course the less complicated implementation, would be as you mentioned, simply process the "known" entries, and filter out the unknown ones, and then have a human do the conversion for those, since our neural inferencing engine still seems to be orders-of-magnitude more functional and effective than a machine's. :)