> - There are three searches logged for each "1999" and "2000" while I
> only searched once.
As this is the live search, the search action isn't triggered when you
hit the enter key, but on a timer. It can happen that a search is
triggered more than once under certain circumstances. Which explains why
performance is crucial for this search action. See below.
> Why do some searches get a "w10"?
The full text index stores multiple sets of data for every item. They
are put in different places, which later on would be weighted by their
importance. w10 has highest weight, w1 lowest.
And here comes the lengthy explanation of what I've found investigating
this very special case. Most likely (depending on your music collection)
this really is a rather special case:
- the keyword is very precise: you know what you expect
- the keyword is very short
- the keyword likely is very popular
Now that popularity thing might be a bit irritating. You probably only
have one single track with this name. Why would it be popular? Because
we're dealing with a full text index, covering not only titles, but lots
of other pieces of information, too. Eg. years, file paths, comments,
even MusicBrainz IDs.
Digging the 99 case in my collection I found a lot of these:
Comment: ExactAudioCopy v0.99pb5
Yep. Or something like that:
UFID: [ http://musicbrainz.org, ebe13618-bbdd-4ef3-9a91-9981602e528f ]
That -9981602e528f at the end would match, too, as our search term is at
the start of that "word".
That would explain the popularity of the search term. But why would an
obvious hit not show up, but some obscure, hidden data would win?
Now this is getting complicated. Many factors play a role: optimization
for speed (which might penalize this particular case), the nature of
full text search indexing not only the obvious data, but anything. And
some poor, deliberate choices. And bugs. Wow. Searching for "99" brought
quite a few issues to the light of day :-).
So there's some optimization going on because the search needs to be
fast. One of these optimizations is to try to limit the result set when
we risk to deal with a large number of hits. Eg. short search terms, or
single terms. In this case we're limiting the results to hits in the
highest priority column only (which explains the "w10:99").
If we know that we are still dealing with a large resultset (>500 items
found), the current implementation would only pick the top 500 items.
And that's where I would say there is/was a bug: we pick the top items
out of an non-ordered list... which means that even if the score of "99
Luftballons" was high, but it was far down the "randomly" ordered result
list, it would be cut off.
When the search is being run, it does weigh the results based on
aforementioned columns. If Nena's album had one track called "99
Luftballons", but another album had ten tracks with the EAC version
string in the comment, the latter might outweigh Nena, because the track
title on an album has weight 5, but the comment has 10x weight 1.
This is where a stupid decision kicks in: for whatever reason I decided
it was a good idea to put the MusicBrainz IDs in w10. Sure, it's a
unique value for every item. But nothing else should have them, right?
Therefore they should always bring up exactly one track, even if the
value is stored in the lowest priority column.
New builds are due out in a bit. Unfortunately my shiny new build system
still isn't installed in a decent place. Therefore I have to upload from
behind this super slow 10Mb connection... So please be patient.
Thanks for an interesting test/edge case! :-)
--
Michael
> only searched once.
As this is the live search, the search action isn't triggered when you
hit the enter key, but on a timer. It can happen that a search is
triggered more than once under certain circumstances. Which explains why
performance is crucial for this search action. See below.
> Why do some searches get a "w10"?
The full text index stores multiple sets of data for every item. They
are put in different places, which later on would be weighted by their
importance. w10 has highest weight, w1 lowest.
And here comes the lengthy explanation of what I've found investigating
this very special case. Most likely (depending on your music collection)
this really is a rather special case:
- the keyword is very precise: you know what you expect
- the keyword is very short
- the keyword likely is very popular
Now that popularity thing might be a bit irritating. You probably only
have one single track with this name. Why would it be popular? Because
we're dealing with a full text index, covering not only titles, but lots
of other pieces of information, too. Eg. years, file paths, comments,
even MusicBrainz IDs.
Digging the 99 case in my collection I found a lot of these:
Comment: ExactAudioCopy v0.99pb5
Yep. Or something like that:
UFID: [ http://musicbrainz.org, ebe13618-bbdd-4ef3-9a91-9981602e528f ]
That -9981602e528f at the end would match, too, as our search term is at
the start of that "word".
That would explain the popularity of the search term. But why would an
obvious hit not show up, but some obscure, hidden data would win?
Now this is getting complicated. Many factors play a role: optimization
for speed (which might penalize this particular case), the nature of
full text search indexing not only the obvious data, but anything. And
some poor, deliberate choices. And bugs. Wow. Searching for "99" brought
quite a few issues to the light of day :-).
So there's some optimization going on because the search needs to be
fast. One of these optimizations is to try to limit the result set when
we risk to deal with a large number of hits. Eg. short search terms, or
single terms. In this case we're limiting the results to hits in the
highest priority column only (which explains the "w10:99").
If we know that we are still dealing with a large resultset (>500 items
found), the current implementation would only pick the top 500 items.
And that's where I would say there is/was a bug: we pick the top items
out of an non-ordered list... which means that even if the score of "99
Luftballons" was high, but it was far down the "randomly" ordered result
list, it would be cut off.
When the search is being run, it does weigh the results based on
aforementioned columns. If Nena's album had one track called "99
Luftballons", but another album had ten tracks with the EAC version
string in the comment, the latter might outweigh Nena, because the track
title on an album has weight 5, but the comment has 10x weight 1.
This is where a stupid decision kicks in: for whatever reason I decided
it was a good idea to put the MusicBrainz IDs in w10. Sure, it's a
unique value for every item. But nothing else should have them, right?
Therefore they should always bring up exactly one track, even if the
value is stored in the lowest priority column.
New builds are due out in a bit. Unfortunately my shiny new build system
still isn't installed in a decent place. Therefore I have to upload from
behind this super slow 10Mb connection... So please be patient.
Thanks for an interesting test/edge case! :-)
--
Michael