InterMix Engine Design (version 0.0.7)
Version 0.0.7 reduces phrases to two words and introduces the idea of
“range by halves”, which expands on the earlier idea of tagging items as
being in the top such and such percentile. Version 0.1.0 will decide between
MixDex and the “range by halves” method. Version 0.0.7 carries both
methods as alternatives. Also the idea of user keywords makes an
appearance.
Version 0.0.6 drops MixDexBase on the grounds that it is needed only
to locate MixDex records for a given item, and that can be done more efficiently
another way. Also we need to be clear that we only add MixDex records
for a particular SortBy on an item if that item has a value for that SortBy.
This means, for instance, that if there is no interest rating for an item,
then there are no Interest SortBy MixDex records for that item. We
also introduce the distinction between "content" and "structural" keywords.
Version 0.0.5 fills in the "perspective" idea, and adds MixDexBase as
the secondary index on MixItem on which MixDex is built. So now we
have MixDexBase as a secondary on Item and MixDex as a secondary on MixDexBase!
Maybe. Please comment whether the new design seems good or problematical.
----------------
MixKeyword records will have a fourth, fifth and sixth field called
KeywordMetaType1, KeywordMetaType2 and KeywordType. KeywordMetaType1
will be a one byte field to indicate that the keyword is derived
from the content of the item, or is structural to InterMix. For example,
KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.
Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant,
'l'ink. No doubt, other structural types will be needed. See
http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
----------------
Understanding "perspectives"
The core concept underlying InterMix is that groups of users rate items.
Of course this happens by individual group members rating the items and
then averaging or accumulating for the group. Nevertheless, the idea
is that the group ratings for items will be very useful in establishing
group priorities and even a group "consciousness" and a group "voice".
See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
InterMix "perspectives" are the groups whose collective ratings are
being tracked by InterMix. An individual may be a member of more
than one perspective. For instance one may live in Pune and be a
citizen of India and be a speaker of English. If these three perspectives
are defined to an InterMix "hub", then when such a person rates an item,
the average rating for the item must be recalculated for all three perspectives
and if it changes for one of the perspectives, then item indexing must
be redone for that perspective.
------------
InterMix data will be kept in five Berkeley DB tables using the Berkeley
DB Transactional Data Store. See http://pybsddb.sourceforge.net/reftoc.html
and
http://pybsddb.sourceforge.net/ref/intro/products.html. Very
possibly, we will end up handling both the item content keywords and the
ratings updates outside the transaction system. In this case, only
the structural keywords will be maintained for an item inside the transaction
system.
1) MixItem table - items are messages, users, groups and so forth --
keyed by a 32 bit sequentially assigned numerical ItemID with an additional
8 bits of flags to distinguish up to 256 types of item by key for a 5 byte
key and stored as a btree table. All MixItem records will be wrapped
in xml in a format inevitably called the MiXml format. The item,
as presented from external sources will be kept in one MiXml section.
Additional sections of interest at this point are the ItemKey, ItemUuid,
ItemStructuralKeywords, ItemContentKeywords. Structural Keywords
and Content Keywords will be kept in separate sections as KeywordIDs
pointing into the MixKeyword table. The idea behind this distinction
is that when another hub requests an item, the structural keywords will
be translated back into their human readable phrase form and sent with
the item, while the content keywords will not -- content keywords will
be rebuilt on the foreign hub.
2) MixKeyword btree table - links strings of any length with much shorter
KeywordIDs. The basic idea is to allow long strings to be indexed
efficiently in the MixDex table. For instance, a long http address
could be reduced to just 10 bytes, say. Importantly, the MixKeyword
table will also keep a count of how many items in MixItem are keyed by
the referenced string. This count will also be used to increase lookup
efficiency. The KeywordIDs will be sequentially assigned. The
primary key will be the string. There will also be an automatically
maintained secondary index to allow lookup of the string from the KeywordID.
MixKeyword records will have a fourth, fifth and sixth field called
KeywordMetaType1, KeywordMetaType2 and KeywordType. KeywordMetaType1
will be a one byte field to indicate that the keyword is derived
from the content of the item, or is structural to InterMix. For example,
KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.
Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant,
'l'ink. No doubt, other structural types will be needed. See
http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
3) MixRef table - relations between pairs of items are kept here, such
as ratings (user/item), or parent-child relationship (item/reply) and so
forth -- keyed by xref-type/item1/item2 and indexed both ways, so for a
particular xref-type, such as "rating", for a particular item1, we easily
find all the item2's, and the other way about. For ratings, we can
easily find the ratings for all items by a particular user, and the other
way about, for a particular item we can easily lookup all ratings for that
item.
4) MixDex is an InterMix maintained secondary index on MixItem.
The purpose of MixDex is to enable fast selections, sorted by ratings values,
on very large collections of items.
5) MixSys table is an easily extendable grabbag for miscellaneous purposes.
It is keyed by 3 byte printable keytype and 10 byte printable keyvalue.
The data is a string of any length.
--------------
MixItem Details
The MixItem key is not going to be printable. We will need a function
to convert the numerical portion of the key into a printable number followed
by 8 Y/N flags so: 0000020181YYNNNYNY.
InterMix will be written so Big vs Little Endian issues never come up.
The key for the first five bytes is a number, but we will convert the number
into a string internally, and then reconvert the string back to number
byte by byte as needed.
The trick is to shift the number in Python to get each byte separately
and then concatenate all the bytes in little endian format so items that
were added near to each other in time will have keys that are near to each
other. Btree is more efficient if "locality of reference" is leveraged.
Besides, we want the keys to be in the same order that the items were added,
so we can use the keys as a standin for a date sort.
-----------
MixDex will be a manually maintained secondary index on MixItem.
See http://pybsddb.sourceforge.net/ref/am/second.html and http://pybsddb.sourceforge.net/api_c/db_associate.html.
By controlling the MixDex table ourselves through InterMix, we can be
more efficient on retrieval by defining the key to be the entire record
and the data to be null. Then when we ask for the next MixDex record,
bdb won't have to read the data -- looking up the key will be sufficient.
MixDex records have five columns: Perspective, SortBy, KeywordID, SortValue,
ItemID.
MixDex
Perspective
SortByType (interest, approval, etc)
KeywordID
SortByValue
ItemID
MixDex needs to be considered in two ways. First, how do we maintain
it when a MixItem changes or when participants rate an item thus changing
SortBy values for the item? Secondly, how do we use it for rapid
lookup of items in sorted order with selection criteria?
---------------
Maintaining MixDex from MixItem
MixDex records for an item come from four sources.
1) Some MixDex records derive from InterMix "structural" keywords that
define the use and place of the item for InterMix handling. For instance,
it will be important that InterMix keep track of other InterMix "hubs"
that it discovers on the internet. The information about these hubs
will be kept as items that have a 'h'ub structural keyword applied to it.
All these structural keywords will be kept in the item itself in an xml
section for that purpose. They will be kept as KeywordIDs pointing
into the MixKeyword table.
2) Some MixDex records come from "content" keywords derived from the
content of the item. For instance, if the word "water" is found in
the item, the KeywordID from the MixKeyword record for MetaType1 'i' for
"water" will be kept in the item in an xml section for that purpose.
3) From special content defined SortBy keywords. The MiXml format,
which will be the subject of another document, will allow multiple custom
SortBys. For instance, a movie database should be able to SortBy
year. The information to create these MixDex records is just the
new SortBy value for the item, which then will cause the creation of one
MixDex record for every structural and content keyword for the item.
It is worth noticing that these special SortBys do not have multiple perspectives,
which limits the number of MixDex records required considerably.
4) From "user keywords". Each InterMix participant can keyword
themselves. "female" or "renter" or "love music". Also participants
can "join" groups, thus keywording themselves as members of that group.
User keywords are applied to every item that the user writes or rates.
Whenever a rating average changes for an item, of if the contents of
the item change or the structural keywords, then the Mixdex records need
to be created/deleted/changed as needed in the most efficient way possible.
--------------
Using MixDex for sorted lookup with selection
To understand the use of MixDex, we simplify MixDex action and say
there is only one rating value that we want to sort items by, say, sum
of interest ratings -- call this "Interest". And say we just want
to select items by keyword, allowing just one keyword in our query.
For example, we could ask to see a list of items that have the word "water"
in them, sorted by Interest.
In this simple case, the MixDex table would have just three columns:
KeywordID, Interest, ItemID. ItemID is the key of the MixItem table.
Interest is the sum of interest ratings for the Item pointed to by ItemID.
KeywordID is the 12 byte ID that we have reduced all keywords to via the
MixKeyword table. For every different word in the Item, there will
be one record in the MixDex table.
For lookup queries, MixDex will be keyed by KeywordID and Interest.
For any lookup query, we first go to the MixKeyword table to find the KeywordID
for the query word. Then we read MixDex in Interest order for that
KeywordID. Each record points to an item record for an item that
has the query word in it.
To handle AND logic, we work as many "logical nodes" of lookup as there
are ANDed query elements. For instance, if the query is for "water"
AND "ice" AND "floe", we will have three nodes of lookup being worked together.
These logical nodes will be implemented as program objects.
First, we get the KeywordIDs for each of the three lookup words and
the MixKeyword.Count for each KeywordID. The MixKeyword.Count field
tells us how many items in MixItem are keyed by this particular KeywordID.
Then we sort the nodes by MixKeyword.Count. Wherever possible, we
will consult first the node that has the least MixKeyword.Count, since
there will be less hits for this KeywordID. To understand, consider
that many more items will have the word "water" than will have the word
"floe". So it will be less work to check first for the next item
with the word "floe" and then see ifInterMix Engine Design (version 0.0.6)
Version 0.0.6 drops MixDexBase on the grounds that it is needed only
to locate MixDex records for a given item, and that can be done more efficiently
another way. Also we need to be clear that we only add MixDex records
for a particular SortBy on an item if that item has a value for that SortBy.
This means, for instance, that if there is no interest rating for an item,
then there are no Interest SortBy MixDex records for that item. We
also introduce the distinction between "content" and "structural" keywords.
Version 0.0.5 fills in the "perspective" idea, and adds MixDexBase
as the secondary index on MixItem on which MixDex is built. So now
we have MixDexBase as a secondary on Item and MixDex as a secondary on
MixDexBase! Maybe. Please comment whether the new design seems
good or problematical.
----------------
MixKeyword records will have a fourth, fifth and sixth field called
KeywordMetaType1, KeywordMetaType2 and KeywordType. KeywordMetaType1
will be a one byte field to indicate that the keyword is derived
from the content of the item, or is structural to InterMix. For example,
KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.
Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant,
'l'ink. No doubt, other structural types will be needed. See
http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
----------------
Understanding "perspectives"
The core concept underlying InterMix is that groups of users rate items.
Of course this happens by individual group members rating the items and
then averaging or accumulating for the group. Nevertheless, the idea
is that the group ratings for items will be very useful in establishing
group priorities and even a group "consciousness" and a group "voice".
See http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
InterMix "perspectives" are the groups whose collective ratings are
being tracked by InterMix. An individual may be a member of more
than one perspective. For instance one may live in Pune and be a
citizen of India and be a speaker of English. If these three perspectives
are defined to an InterMix "hub", then when such a person rates an item,
the average rating for the item must be recalculated for all three perspectives
and if it changes for one of the perspectives, then item indexing must
be redone for that perspective.
------------
InterMix data will be kept in five Berkeley DB tables using the Berkeley
DB Transactional Data Store. See http://pybsddb.sourceforge.net/reftoc.html
and
http://pybsddb.sourceforge.net/ref/intro/products.html. Very
possibly, we will end up handling both the item content keywords and the
ratings updates outside the transaction system. In this case, only
the structural keywords will be maintained for an item inside the transaction
system.
1) MixItem table - items are messages, users, groups and so forth --
keyed by a 32 bit sequentially assigned numerical ItemID with an additional
8 bits of flags to distinguish up to 256 types of item by key for a 5 byte
key and stored as a btree table. All MixItem records will be wrapped
in xml in a format inevitably called the MiXml format. The item,
as presented from external sources will be kept in one MiXml section.
Additional sections of interest at this point are the ItemKey, ItemUuid,
ItemStructuralKeywords, ItemContentKeywords. Structural Keywords
and Content Keywords will be kept in separate sections as KeywordIDs
pointing into the MixKeyword table. The idea behind this distinction
is that when another hub requests an item, the structural keywords will
be translated back into their human readable phrase form and sent with
the item, while the content keywords will not -- content keywords will
be rebuilt on the foreign hub.
2) MixKeyword btree table - links strings of any length with much shorter
KeywordIDs. The basic idea is to allow long strings to be indexed
efficiently in the MixDex table. For instance, a long http address
could be reduced to just 10 bytes, say. Importantly, the MixKeyword
table will also keep a count of how many items in MixItem are keyed by
the referenced string. This count will also be used to increase lookup
efficiency. The KeywordIDs will be sequentially assigned. The
primary key will be the string. There will also be an automatically
maintained secondary index to allow lookup of the string from the KeywordID.
MixKeyword records will have a fourth, fifth and sixth field called
KeywordMetaType1, KeywordMetaType2 and KeywordType. KeywordMetaType1
will be a one byte field to indicate that the keyword is derived
from the content of the item, or is structural to InterMix. For example,
KeywordMetaType1 'i' indicates an 'i'tem content derived keyword.
Structural KeywordMetaType1s will be 'h'ub, ''c'ategory, 'p'articipant,
'l'ink. No doubt, other structural types will be needed. See
http://blog.voiceofhumanity.net/newslog2.php/_v252/__show_article/_a000252-000018.htm.
3) MixRef table - relations between pairs of items are kept here, such
as ratings (user/item), or parent-child relationship (item/reply) and so
forth -- keyed by xref-type/item1/item2 and indexed both ways, so for a
particular xref-type, such as "rating", for a particular item1, we easily
find all the item2's, and the other way about. For ratings, we can
easily find the ratings for all items by a particular user, and the other
way about, for a particular item we can easily lookup all ratings for that
item.
4) MixDex is an InterMix maintained secondary index on MixItem.
The purpose of MixDex is to enable fast selections, sorted by ratings values,
on very large collections of items.
5) MixSys table is an easily extendable grabbag for miscellaneous purposes.
It is keyed by 3 byte printable keytype and 10 byte printable keyvalue.
The data is a string of any length.
--------------
MixItem Details
The MixItem key is not going to be printable. We will need a
function to convert the numerical portion of the key into a printable number
followed by 8 Y/N flags so: 0000020181YYNNNYNY.
InterMix will be written so Big vs Little Endian issues never come
up. The key for the first five bytes is a number, but we will convert
the number into a string internally, and then reconvert the string back
to number byte by byte as needed.
The trick is to shift the number in Python to get each byte separately
and then concatenate all the bytes in little endian format so items that
were added near to each other in time will have keys that are near to each
other. Btree is more efficient if "locality of reference" is leveraged.
Besides, we want the keys to be in the same order that the items were added,
so we can use the keys as a standin for a date sort.
-----------
MixDex will be a manually maintained secondary index on MixItem.
See http://pybsddb.sourceforge.net/ref/am/second.html and http://pybsddb.sourceforge.net/api_c/db_associate.html.
By controlling the MixDex table ourselves through InterMix, we can
be more efficient on retrieval by defining the key to be the entire record
and the data to be null. Then when we ask for the next MixDex record,
bdb won't have to read the data -- looking up the key will be sufficient.
MixDex records have five columns: Perspective, SortBy, KeywordID, SortValue,
ItemID.
MixDex
Perspective
SortByType (interest, approval, etc)
KeywordID
SortByValue
ItemID
MixDex needs to be considered in two ways. First, how do we maintain
it when a MixItem changes or when participants rate an item thus changing
SortBy values for the item? Secondly, how do we use it for rapid
lookup of items in sorted order with selection criteria?
---------------
Maintaining MixDex from MixItem
MixDex records for an item come from three sources.
1) Some MixDex records derive from InterMix "structural" keywords that
define the use and place of the item for InterMix handling. For instance,
it will be important that InterMix keep track of other InterMix "hubs"
that it discovers on the internet. The information about these hubs
will be kept as items that have a 'h'ub structural keyword applied to it.
All these structural keywords will be kept in the item itself in an xml
section for that purpose. They will be kept as KeywordIDs pointing
into the MixKeyword table.
2) Some MixDex records come from "content" keywords derived from the
content of the item. For instance, if the word "water" is found in
the item, the KeywordID from the MixKeyword record for MetaType1 'i' for
"water" will be kept in the item in an xml section for that purpose.
3) From special content defined SortBy keywords. The MiXml format,
which will be the subject of another document, will allow multiple custom
SortBys. For instance, a movie database should be able to SortBy
year. The information to create these MixDex records is just the
new SortBy value for the item, which then will cause the creation of one
MixDex record for every structural and content keyword for the item.
It is worth noticing that these special SortBys do not have multiple perspectives,
which limits the number of MixDex records required considerably.
Whenever a rating average changes for an item, of if the contents of
the item change or the structural keywords, then the Mixdex records need
to be created/deleted/changed as needed in the most efficient way possible.
--------------
Using MixDex for sorted lookup with selection
To understand the use of MixDex, we simplify MixDex action and say there
is only one rating value that we want to sort items by, say, sum of interest
ratings -- call this "Interest". And say we just want to select items
by keyword, allowing just one keyword in our query. For example,
we could ask to see a list of items that have the word "water" in them,
sorted by Interest.
In this simple case, the MixDex table would have just three columns:
KeywordID, Interest, ItemID. ItemID is the key of the MixItem table.
Interest is the sum of interest ratings for the Item pointed to by ItemID.
KeywordID is the 12 byte ID that we have reduced all keywords to via the
MixKeyword table. For every different word in the Item, there will
be one record in the MixDex table.
For lookup queries, MixDex will be keyed by KeywordID and Interest.
For any lookup query, we first go to the MixKeyword table to find the KeywordID
for the query word. Then we read MixDex in Interest order for that
KeywordID. Each record points to an item record for an item that
has the query word in it.
To handle AND logic, we work as many "logical nodes" of lookup as there
are ANDed query elements. For instance, if the query is for "water"
AND "ice" AND "floe", we will have three nodes of lookup being worked together.
These logical nodes will be implemented as program objects.
First, we get the KeywordIDs for each of the three lookup words and
the MixKeyword.Count for each KeywordID. The MixKeyword.Count field
tells us how many items in MixItem are keyed by this particular KeywordID.
Then we sort the nodes by MixKeyword.Count. Wherever possible, we
will consult first the node that has the least MixKeyword.Count, since
there will be less hits for this KeywordID. To understand, consider
that many more items will have the word "water" than will have the word
"floe". So it will be less work to check first for the next item
with the word "floe" and then see if it also has the word "water", than
if we look first at items with the word "water" and check each one if it
has the word "floe". We assume for the example that more items have
"water" than have "ice" than have "floe".
So first we use the KeywordID_Interest index on MixDex to find the
MixDex.MixItemID that has the word "floe" with the highest Interest.
Then we check for the MixItemID that has the word "ice" with the highest
Interest. There are three cases: the Interest field for the "floe"
node is less than, equal to or greater than the Interest field for the
"ice" node.
If the two Interest fields are not equal, then we don't have to bother
actually comparing the MixItemID's, because we know the Interest fields
would have to be the same for the same MixItemID, so we already know the
MixItemID's are not the same. Therefore we can discard all MixItemID's
for the node with the higher Interest, down to the Interest of the other
node. If the "floe" node shows an Interest of 12292 and the "ice"
node shows an Interest of 223299, we can restart the "ice" node index at
12292 and look for the next record with an Interest of 12292 or less.
If we get a hit at 12292, then we check the "water" node, else we restart
the "floe" node at the new Interest of the "ice" node and so forth.
To handle OR logic, we sort the nodes so we look at the node with the
largest instead of the smallest MixKeyword.Count first.
Node:
NodeType - Simple or Compound
direction (sort high to low or low to high)
sortby - Interest, Approval, etc
perspective
KeywordID - only simple nodes have KeywordID value
list of SubNodes - only Compound Nodes have SubNodes
LogicalFunction (AND, OR, NOT) - only Compound Nodes have
LogicalFunction
Count - for simple nodes, this is the count from the MixKeyword
table
for compound nodes, the count is derived from the SubNode Counts
CurrentSortbyValue
CurrentMixItemID
function SetCursor() - uses CurrentSortbyValue
function GetNext() - returns next SortbyValue and MixItemID
Some Notes:
Compound nodes with NOT logic have only a single sub-node.
The count for a compound AND node is the smallest count of any of the
subnodes.
The count for a compound NOT node is calculated as TotalNumberOfItems
minus the count for the subnode.
The count for a compound OR node is calculated over the list of subnodes
using the ResultCount from the previous subnode to calculate the new ResultCount
for the next subnode. Beginning with a current ResultCount of zero,
the next ResultCount = current ResultCount + (1 - (current ResultCount
/ TotalNumberOfItems)) * nextSubNodeCount.
For example if there are 1000 items altogether and we are calculating
the count for "water" OR "ice" where "water" has a count of 100 and "ice"
has a count of 10, then
a) 100 = next ResultCount = 0 + ((1 - (0 / 1000)) * 100)
b) 109 = next ResultCount = 100 + ((1 - (100 / 1000)) * 10)
This may seem like a lot of trouble for no effect -- except it does
make a difference if we are dealing with a subnode that implements NOT
logic where the count can be very large.
For example if in the same database as above, we search for (NOT "water")
OR (NOT "ice"), then the count for (NOT "water") is 900 and for (NOT "ice")
is 990, so
a) 900 = next ResultCount = 0 + ((1 - 0 / 1000)) * 900)
b) 999 = next ResultCount = 900 + ((1 - (900 / 1000)) * 990)
---------------------------------------------
The difficulty with the MixDex design is that adding a new item or
updating a rating may be too slow because there are so many MixDex records
to be added or updated.
To understand the difficulty, imagine a moderately sized item - say
a one thousand word essay. Assuming we will index phrases up to five
words in length, that gives us maybe 4*975 for the phrases + 600 basic
words = 4500 content KeywordIDs. (There will also be perhaps 10 or
20 structural keywords, but these are so few, we can ignore them.)
Now for every SortBy we will need one MixDex record for each KeywordID.
Worse, for the rating SortBys we will need an additional set of MixDex
records covering all the KeywordIDs. We will want always to sort
by 1) date and 2) thread/date, and 3) on average one special SortBy so
that is three sets of MixDex records, or 13,500 of them. Then for
each rating SortBy, there will be one set for every perspective.
So that is Interest, Approval, Controversy, Value, and Usefulness - five
rating type SortBys for maybe 8 perspectives gives 193,500 MixDex records!
Perhaps 8 megs of complex indexing for a single modest item entry.
To mitigate this design disaster, we begin by removing a fairly extensive
list of common generic words, such as "is" and "very". This only
helps with the basic word list, not with the phrases, but still, that brings
us down to maybe 4400 keywordIDs per SortBy/Perspective, or 189,200 MixDex
records.
We can do even better by keeping two word phrases only (instead of 2,3,4
and 5 word phrases) and stopping phrases at periods and semi-colons.
We handle requests for longer phrases by building up the longer phrase
from two-word components. Thus a request for “the lazy brown fox”
is handled as a request for “the lazy” AND “lazy brown” AND “brown fox”.
This method will give a few false positives but not so many as to make
a real difference. Also, on phrase indexing, we do not need to look past
periods or semi-colons. On our 1000 word essay, this will bring us
down to maybe 1450 keywords from 5000, and our total down to 62,350 from
215,000 MixDex records.
Then we realize that "Usefulness" could be handled as "Approval" - so
now we have only 4 rating SortBys, which brings us down to 46,400 records.
Moreover, we could say that we keep MixDex records only for the 20 percent
of messages that have the most number of ratings by Perspective.
So we are down to 13,630 MixDex records. Still too high, so maybe
only the 10 percent of messages with the most ratings – brings us down
to 8,990. This is on the assumption of 8 perspectives.
Finally, only the structural keyword MixDex records are written and
deleted as part of the transaction that adds or updates the item.
The adding/updating of item content based MixDex records is queued for
a work thread and handled outside of the transaction. The down side
to this is that if the system crashes, item content MixDex records will
sometimes be out of sync with the item data. We handle this by backing
up the queue with a system file record for each item that is queued, and
deleting that system file record only when the work thread has completed
for that item and flushed the cache to disk. Every time InterMix
comes up, it will check the system file for unfinished business and recreate
the queue as needed.
With disk space getting cheaper at a rapid rate, we are perhaps just
into the realm of the possible for the home computer.
It would be better though if every common enough user keyword could
be a perspective, instead restricting the number of perspectives to a small
preset number. Thus if a sufficient number of people are tagged as
“funny”, then there should be a perspective of funny people, and so forth
- intellectuals, homeowners – perhaps hundreds of single factor perspectives.
We would limit the number of such perspectives by having a threshold number
of participants who must share a perspective before the perspective is
activated.
Some of the more popular perspectives, like “woman” or “homeowner”
may have sub-perspectives, such as “single woman” or “Santa Monica homeowner”.
However, the number of perspectives grows exponentially if every combination
is going to be tracked, so multi-factor perspectives will have to be consciously
applied as such by the users to be considered.
Here, then, is an alternative plan for implementing selection with sort
on large numbers in a way that allows for many perspectives to be tracked:
Select first and then sort -- “SelectAndSort” we can call it.
In a way, this is the obvious method, except of course, a problem arises
when the selection is huge, as for instance, when no selection criteria
are used, so every item in the database is selected. Clearly the
selection must be limited in number if the sort is to be after selection,
but in that case, how do we guarantee that the highest rated items make
the cut?
A possible solution would be to have range keywords to use in the selection.
For instance we might divide the “Interest” range into hundreths, and look
first for items in the top percentile, then do a fresh search for those
in the second highest percentile and so forth until we have accumulated
a preset number of items to be sorted by average Interest.
If the number of items in the pool becomes very large, however, then
any scheme based on a set number of ranges, such as 100, will fail, because
too many items will always be selected. We need a method whereby
the number of items can increase indefinitely but in a way that is easy
to handle programmatically and that will keep the number of ranges from
becoming too large. One possible such solution is what I call “range
by halves”.
Here is an example of range by halves. Let's say the limit for
selection is 1000 items. That is we will select 1000 items max and
then sort them by rating. When the number of items in the pool becomes
1001, we need to split the pool into a top half and a bottom half, assigning
keywords that indicate which half an item falls into. When the top
half gets its own 1001st item, we split the pool into quarters, but this
time we assign new keywords only to the top quarter and the bottom quarter.
Then when the top quarter reaches 1001, we divide the top and bottom quarters,
assigning a keyword only for the top eighth and bottom eighth. Assuming
the pool is growing at a steady rate, the splits become rarer and rarer,
and each split requires only the same number of new keyword assignments
as the first split did – 500 items at the top and 500 items at the bottom
are assigned an additional, new range keyword and have the old range keyword
removed. Whenever there is a split, the old range keywords remain
accurate for all except the top and bottom 500.
Notice that after a split, items in the new ranges no longer have their
old keywords. Thus if the top quarter is split, the items in the
new top eights are keyed “top eighth”, but items in the 2nd from the top
eighth are keyed “top quarter”. To select the top quarter, one must
then specify “top eighth” OR “top quarter”. If one specifies only
“top quarter for a range that is split into eighths, then one actually
gets only the 2nd top eighth. A bit confusing.
In the range by halves method, the top and the bottom splits are coordinated.
When one splits, the other splits. New range delimiters are calculated
at split time, but existing keywords are not redone because that could
be too massive. Rather, existing range keywords are redone item by
item when new ratings for the item come in.
Interestingly, archiving can work with this scheme. Archiving
is the removal of items from the active database. First, of course,
old items that have no ratings are archived. If rated items need
to be archived, old items that are in the middle two quarters of the pool
will be archived first. The current set of range keywords is kept
for a pool of items that have been partly archived, but a new set of range
delimiters is calculated that will gradually reduce the numbers of items
in the more extreme high and low ranges as existing range keywords are
redone for each item that gets a new rating.
The advantage of this SelectAndSort with range by halves is that the
number of keyword records is greatly reduced while a large number of perspectives
can be accomodated. Each perspective will have its own set of range
by halves keywords that are based on the size of the pool of items for
that perspective. (An item is in the pool for a perspective if anyone
from that perspective has rated the item.)
One disadvantage of SelectAndSort with range by halves is that only
the top or bottom limit number of items (1000 in the example) can be listed
for a given keyword selection. There are ways around this limitation,
but they will be cumbersome. A second disadvantage is that building
the list will be slower than via MixDex, perhaps noticeably slower.
Once the list is built, then SelectAndSort with range by halves should
be fast, but typically people only want to see the first page of a selection,
so we build a list of 1000 to display a list of 40, and then build another
list of 1000 and so forth. This seems a serious drawback.
--------------------------------------------
Because there is more than one "SortBy" we have the possibility to
add an additional column to the MixDex file for each SortBy. One
column for Approval rating, another column for Interest, and yet a third
for Controversy. Or we can add a single column SortByType in MixDex.
Thus, option one, add a column for each SortBy would have MixDex records
like this:
MixDexRec
KeywordID
InterestSortBy
ApprovalSortBy
ControversySortBy
ItemID
Option two gives us
MixDexRec
KeywordID
SortByType (interest, approval, etc)
SortByValue
ItemID
For every KeywordID for an Item, option two will have three records,
while option one will have only one record. Also, because in option
two the KeywordID and ItemID has to be repeated in each record, and we
also have a SortByType column, option two will take quite a bit more space.
Nevertheless, I think we should go with option two because it is more
flexible. If we want to add a new SortBy, it will be relatively easy
to do so.
---------------
Rethinking the options.
There is an option three, and that is to have a separate secondary
index for each SortBy.
This gives us:
MixDexInterestRec
KeywordID
Interest
ItemID
MixDexControversyRec
KeywordID
Controversy
ItemID
and so forth. This has the advantage of smaller tables.
It is simple. However we do have to go change the code in multiple
places when we want to add a new SortByType and therefore makes it difficult
for us to give the user the ability to add new sorts. So I think
we still stick with option two for now.
If the speed problem with option two is too great, we may implement
both option one and option three, using a separate thread for each of the
SortBy MixDexBase tables. In any case, the work will be done in work
threads so we can return immediately to the requestor.
------------------
A point to realize: Berkeley DB btree is optimized to insert new records
in increasing sequence -- in other words, if the records are already sorted
by key, then bdb btree is more efficient. See http://peertech.org/BdbAndSmallRecords.
Here is a quote from a Sleepycat article: "As records are inserted and
pages in the B+tree fill up, they are split, with about half the keys going
into a new peer page at the same level in the tree. Most B+tree implementations
leave both nodes half-full after a split. This leads to poor performance
in a common case, where the caller inserts keys in order. To handle this
case, Berkeley DB keeps track of the insertion order, and splits pages
unevenly to keep pages fuller. This reduces tree size, yielding better
search performance and smaller databases." -- see http://www.sleepycat.com/docs/ref/refs/bdb_usenix.html |