on 06-25-2010 8:27 PM
Please note that if you submit an answer to this question, your code will be used in a NOT-FOR-PROFIT scientific research program, and you will be given credit for the code in any publication etc.
Problem Statement:
1. You have a string S of 20-80 characters over the alphabet A-Z (this string is drawn at random from a database DS of strings over A-Z.)
2. You have grouped the letters A-Z into the following three groups a, j, and t:
a = A-I j = J-S t = T-Z.
3. Within the string S, you identify all doublets (two consecutive letters) such that the first letter of the doublet is within group a or group j, and the second letter of the doublet is in group a or j.
4. Since you are only interested in these doublets within S, you represent S (for example) as the string S':
AKS . . . . . BL . . . . . . MD . . . . . EE (where each "." represents a letter in group t.)
where the "group-level" representation of the string S' is the string S":
ajj . . . . . aj . . . . . . ja . . . . . aa
5. You want to search the database DS and return all strings that have the form S" at the group-level.
6. Also, and MOST IMPORTANTLY, you want to return a string Sz from the database even if there is only a rough spacing correspondence between Sz and the template string S". The "allowable spacing difference" rule is that for a doublet Dz in Sz to match a doublet Ds" in S' (at the group-level), there must be no more than four characters between the end of Dz and the start of Ds", or vice-versa.
Edited by: David Halitsky on Jun 25, 2010 10:03 PM
Edited by: David Halitsky on Jun 25, 2010 10:05 PM
In the original definition, the "." character was described as a letter from the set T-Z. That has been carried forward as the sets were redefined to be characters from the input set (described as the alphabet originally) that did not positionally map as a portion of a doublet.
With that view, the matching of the doublet string was consistent between the search and db string.
In the clarification just added:
CY. . . ARL . . . . . . . . . . MT
with the expectation of finding matches on:
ll. . . blb . . . . . . . . . . bl (where "." stands for anything ...
The addition 'where "." stands for anything' makes the intent unclear. Since 'anything' can include mapped characters, the 'greedy' tendency of the matching alogorithm now comes into play. Say the string to check for matches is:
ll. . . blb . . . . . . . . blblbl
Using the expression default greedy tendency, the string that matches will be:
ll. . . blb . . . . . . . . blblbl
Disabling the greedy tendency, the string that matches will be:
ll. . . blb . . . . . . . . bl
Under a single expression, the following possible match string will not be detected:
ll. . . blb . . . . . . . . blbl
This will be a problem in any scenario where you have multiple matches allowed at a single offset point. Since the requirements desire all possible matches with overlaps, detection of all possible matches with multiple matches at an anchor point requires generation of a match string per possible combination based on proximity size otherwise only a single greedy tendency can be supported. With the example above, I think that corresponds to 72 patterns with a proximity of 4. This expands greatly as the number of blocks of non-significant match strings increase.
Can you clarify the matching rules you are looking for?
Thanks,
John Benson
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
John -
Thanks for thinking so carefully about the consequences of each part of the "spec" - it's as important to this mini-project as your ability to effectively code what's ultimately decided on.
Regarding the choice between greedy and non-greedy, would it be possible to turn off greedy so that the effect is to find the nearest match that satisfies the supplied proximity parameter? (In your example, the returned string would be the one that ends in bl, not blblbl.) If so, this is definitely the way to go,
Regarding what should be allowed as ".", would it be possible to code the program so that as it searches, it ignores whatever satisfies a "." character in the search string?
Yes, this would allow the program to match a string s to a search string S even though S might have one or more "doublets of interest" that S does not have. But this is OK from an empirical perspective, because it might be the case that via evolution (mutation), S might have "lost" doublets of interest that s has maintained.
If the two rules suggested above (for the greedy issue and the singleton wildcard issue) are NOT sufficiently explicit, please let me know what outstanding questions need to be addressed inorder to make the rules sufficiently explicit.
Again, thanks so much for your "ounce of prevention" approach here.
Best
djh
Hi David,
The "." characters can be made to match any character as long as a greedy or non-greedy (minimal match size) preference is used. That restricts the matches to a maximum of 1 for any starting point. With that detemined, I need to change the generated match pattern to go for minimal match size. That should not take too much to implement.
I will let you know when I get a chance to do that.
John -
You wrote:
"That restricts the matches to a maximum of 1 for any starting point"
That's perfectly OK for the purposes of this research tool, because overall, your program will be choosing multiple starting points.
Yes - please let us know in this thread when the final version is done, and we can make off-line arrangements for making the program available. I can make a new post in the Bioinformatics Wiki that I started last year, and point folks from this thread to that Wiki entry.
Again, words cannot express .... you know the rest ....
djh
John -
You wrote:
"That restricts the matches to a maximum of 1 for any starting point"
That's perfectly OK for the purposes of this research tool, because overall, your program will be choosing multiple starting points.
Yes - please let us know in this thread when the final version is done, and we can make off-line arrangements for making the program available. I can make a new post in the Bioinformatics Wiki that I started last year, and point folks from this thread to that Wiki entry.
Again, words cannot express .... you know the rest ....
djh
Hi David,
The target DB would be helpful to spot check that the "greedy" settings do what I expect. The reduced expansion set would also be helpful. I could do a quick sanity check with those and five or so example search patterns. That should uncover any basic flaws in the algorithm.
Thanks,
John Benson
John -
ZIP of target DB and reduced expansion rules comng at you shortly.
Regarding example search patterns, I have a suggestion that will reduce the chance of human error (by me) if I attemot to construct the example search patterns manually.
In particular, when you receive the target DB, could you write a one-off that simply translates all the sequences in this DB into sequences containing just "."s and doublets of interest?
That way, it would be easy for you to select search patterns of varying length and see if they find themselves, and then, by altering spacing in the selectd patterns, you could easily check your proximity logic.
I realize this is yet more code - believe me, I really do - but inasmuch as you've truly professionalized this effort so far, it would be great if you could find the time to write the one-off and thereby keep the example search pattern generation technique equally professional ...
Best
djh
I did a quick look at this and the expanded requirements appear to have a conflict with the example information. If the doublets allow an individual letter to compose part of two adjacent doublets, I would expect the representation for that letter would have to be consistent.
Take sequence RVT as a simple example. I would expect that to convert to lbl. This works fine with the samples based on
RV => lb
VT => bl
However, if I look at sequence DLK
DL => ll
LK => bl
This does not make sense as I read it because the L translates to l in the first doublet and b in the second doublet. This would work if the sequence expanded but you said there would not be an expansion due to the doublet mapping.
Can you explain how this should be handled?
Thanks,
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Sorry John - that was an idiot typo - DL should be "lb", not "ll".
I'll proofread for any other typos like that, but in case I miss one, here are the two rules for determining l and b:
b: FLIMVPAWG
l: STYHQNKDECR
In other words, your instinct was correct. The mapping from single letter to b or l is invariant ...
Again, my apologies ...
djh
I expanded the program and did a test run. Some sample results are as follows:
Pattern "Eval 1" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}"
RAY MGKDYYQTLGLARGASDEEIKRAYRRQA
lbl bbllllllbbbblbbllllbllbllllb
lbl
match (1) at offset 19
match (2) at offset 22
Pattern "Eval 2" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}"
MGKDYTY MGKDYYQTLGLARGASDEEIKRAYRRQA
bblllll bbllllllbbbblbbllllbllbllllb
bl match (1) at offset 1 Pattern "Eval 3" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}" MGL MGKDYYQTLGLARGASDEEIKRAYRRQA bbb bbllllllbbbblbbllllbllbllllb b match (1) at offset 9 match (2) at offset 10 Pattern "Eval 4" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}" APXAY MGKDYYQTLGLARGASDEEIKRAYRRQA bb,bl bbllllllbbbblbbllllbllbllllb b,{0,5}bl
match (1) at offset 10
Pattern "Eval 3" found in "gi|d1hdj__ a.2.3.2| HSP40 {Human (Homo sapiens)}"
MGL MGKDYYQTLGLARGASDEEIXWRYSKRAYRRQA
bbb bbllllllbbbblbbllllb,blllllbllllb
b
match (1) at offset 9
match (2) at offset 10
Pattern "Eval 4" found in "gi|d1hdj__ a.2.3.2| HSP40 {Human (Homo sapiens)}"
APXAY MGKDYYQTLGLARGASDEEIXWRYSKRAYRRQA
bb,bl bbllllllbbbblbbllllb,blllllbllllb
b,{0,5}bl
match (1) at offset 10
The output lines are as follows:
Line 1 Name of pattern to look for followed by the name of the DB line with a match
Line 2 Pattern to look for followed by the DB doublet source
Line 3 Doublets for pattern to look for followed by doublets for the pattern to search
Line 4 Search pattern determined with proximity allowance
Line 5+ Match number and offset in pattern where found (offset is actual position number in string)
I think this is consistent with the expected results but I am not 100% certail. Do you have any realistic DB lines and search patterns that should match? The original examples in this thread had sparsely populated significant doublets. Once I applied the doublet expansion, 399 of the 676 possible variations assuming full alphabet usage were then defined. This changed the characteristics to a higher density of significant match positions so test cases are harder to determine.
Hi John -
Those results look good, at least for contiguous search patterns.
One clarifiicatiion and one question:
1) Clarification: there will only ever be 20 letters total, corresponding to the 20 core amino acids that occur in proteins. So, there will never be search patterns or DB strings with the letters B, J,O,U,X,Z. I don't think the program will have to be adjusted to take this into account, I just wanted you to be aware of why some letters are "missing". Also, the fact that the expansion rules produce 399 out of the 400 possible combinations of 20x20 letters means that I have to "cut back" the productivity of the expansion rules - I thought this might be the case and I do have a principled way to cut back the expansion rules so that far less than 399 final doublets are produced by them.
2) Question: Will the program as it stands now allow me to enter a non-contiguous search pattern like:
CY. . . ARL . . . . . . . . . . MT
with the expectation of finding matches on:
ll. . . blb . . . . . . . . . . bl (where "." stands for anything ...
Finally, regarding testing of the PERL as it now stands - if you provide a final copy, I will test thoroughly against the real database with search strings of actual empirical interest, where I know the correct answers in advance.
Again, let me thank you very much again for what you've done here ... almost there. I think.
Best
djh
The ABAP won't paste into the editor correctly so I will send you an email copy.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
John u2013
First, let me thank you for what youu2019ve done here u2013 I cannot tell you how much itu2019s appreciated.
Second, if I could lean on you for just a little more of your creativity, Iu2019d like to ask if youu2019d consider doing some revisions that will really add to the utility of what youu2019ve already done. These revisions do not involve the core of what youu2019ve done u2013 the final program will still try to match a search string to corresponding strings in a file, depending again on the position of certain doublets in the search string.
What needs to change are:
1) the way in which u201Capprovedu201D doublets are initially identified and the way in which the group representation of these u201Capprovedu201D doublets is determined;
2) the way in which the input file is processed (the one that is searched against)
3) the way in which the matches are ouput.
The best way to spec these changes is to propose the formal parametes for an actual Perl command line and then explain what processing should result from each of these parameters.
The command line we need can be illustrated as follows:
where:
matchem is the name of the program
core.txt is a file with an initial list of u201Capprovedu201D doublets and their group representation
xpnd.txt is a file with instructions as to how to expand the initial u201Ccoreu201D list of approved doublets into a final list of u201Capprovedu201D
doublets (this file may be empty, in which case u201Ccore.txtu201D is the final list of approved doublets)
the n value of prox sets the u201Cproximity valueu201D for the match (in the original spec, this was 4)
db.txt is the file of strings which the program will match the input string against.
the u201C-su201Dstring is the search string u2013 the one the program will try to find matches for in db.txt.
Details on the core.txt file
Here is the u201Ccore.txtu201D file containing the initial list of u201Capprovedu201D doublets and their group representation:
matchem u2013core core.txt u2013xpnd xpnd.txt u2013prox n u2013targ db.txt u2013s u2018u2026.u2019
As you can see, the group representation of the u201Capprovedu201D doublets uses two letters u2013 a lower-case B and a lower-case L.
Details on the xpnd.txt file
Here is the xpnd.txt file:
FF bb
FL bb
LL bb
FA bb
LK bl
LS bl
LT bl
LY bl
LG bb
IP bb
IA bb
ML bb
VL bb
VS bl
VT bl
VG bb
SL lb
SP lb
PV bb
PG bb
TL lb
TP lb
AL bb
AV bb
AP bb
AN bl
AS bl
YP lb
HP lb
QR ll
KN ll
KK ll
DL ll
EN ll
EK ll
ER ll
WR bl
RV lb
RY ll
RW lb
RR ll
SG lb
RR ll
GF bb
GL bb
GY bl
GK bl
GW bb
GR bl
GS bl
And here is an example of the way this file works in conjunction with the u201Ccore.txtu201D file to generate the final list of u201Capprovedu201D doublets.
1) VT is an approved doublet in the core.txt file
2) V is associated with the letters FLIMPAG in the xpnd.txt file
3) T is associated with the letters SNKDER in the xpnd.txt file
4) Therefore, new approved doublets should be generated by taking each element of with each element of , in that order (for example, FS, FN, FK, FD, FE, FR)
5) For each new doublet formed in this way, the group representation should be the same as that of the original doublet in the core.text file (in this case, the group representation for each new doublet would be u201Cblu201D, since u201Cblu201D is the group representation of the original doublet VT
6) Note that the final u201Capprovedu201D doublet list should be u201Cde-duplicatedu201D so that no doublet appears more than once.
Details on db.txt file
Each string in the db.txt file will be on one line of the file, and each string will be preceded by a u201Cstring idu201D line with information about the string itself. Like this:
F LIVPA
L FIMVPAW
I FLMVPAG
M LIVPAG
V FLIMPAG
P FLIMVAW
A FLIMVPG
W LPG
G IMVAW
S TYHQNKDECR
T SNKDER
Y SHQNDCR
H SYQNDCR
Q SYHKECR
N STYHKDECR
K STQNDER
D STYHNKECR
E STQNKDR
C SYHQNDR
R STYHQNKDEC
Use of the u201Cstring idsu201D in the output file of matches
The u201Cstring idu201D for each string in the db.txt file will be used in the output of the program u2013 in particular, when matches are reported in the output of the program, each match must be preceded (on the same line) by the sequence ID of the string in which the match was found. For example:
gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)}
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGC
Multiple matches within the same string in the db.txt file
We want to look for ALL matches to a given search string within each string of the db.txt file. For shorter search strings, there may be many matches within a single string in the db.txt file.
Again, many many thanks, John ...
gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)} MGKDYYQTLGLARGASDEEIKRAYRRQA
There are a couple ways to view the proximity of the doublets. For this response I picked proximity as being relative to the current matching doublet and not to the absolute position in the string.
Due to the nature of the matching, my solution is based upon creating the doublet pattern for both the search string and the strings to be evaluated. The pattern for the search string is then converted to an expression that gives the allowances for the position variation. The example perl is:
#!/usr/bin/perl
$starting = shift;
$search = create_doublet( $starting );
$search = create_match( $search );
print "Input string\n $starting\n $search\n\n";
$infile = shift;
open( INFILE, $infile );
while (<INFILE>) {
chomp;
my $line = $_;
my $str = create_doublet( $line );
print "check $line\n";
if( $str =~ /$search/ ) {
print "Match\n\n";
} else {
print "No Match\n\n";
}
}
close( INFILE );
sub create_doublet {
$new = shift;
$new =~ s/[A-I]/A/g;
$new =~ s/[J-S]/J/g;
$new =~ s/[T-Z]/t/g;
$new =~ s/^Jt/tt/g;
$new =~ s/tJ\s*$/tt/g;
$new =~ s/^At/tt/g;
$new =~ s/tA\s*$/tt/g;
$new =~ s/tAt/ttt/g;
$new =~ s/tJt/ttt/g;
return $new;
}
sub create_match {
my $pattern;
$_ = shift;
@fields = /A+|J+|t+/g;
foreach (@fields) {
$fld = $_;
$char = substr( $fld, 0, 1);
$pattern .= $char;
$cnt = length "$fld";
if( $fld =~ /t/ ) {
$low = $cnt - 3;
$high = $cnt + 3;
if( $low < 0 ) {
$low = 0;
}
$pattern .= "{" . "$low" . "," . "$high" . "}";
} else {
$pattern .= "{" . "$cnt" . "}";
}
}
if( $pattern =~ /^(A|J)/ ) {
$pattern = "t{0,4}" . $pattern;
}
if( $pattern =~ /(A|J)\{[0-9]+\}\s*$/ ) {
$pattern .= "t{0,4}";
}
return $pattern;
}
When this is run using an input file as follows:
AASTQWIULKONHBGDUXXIOL
TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
WVWAASTQWXXIUKONHBDUXXIOL
AASTQWIUKIONHBDUXXIOLWWX
AASTQWIUKONHBDUXXIOLWWX
AASVTQWVTIUKONHBDUXXWIOL
AASTQWIUKONHBDUXXIOL
The result is:
Input string
AASTQWIUKONHBDUXXIOL
t{0,4}A{2}J{1}t{2,8}J{3}A{3}t{0,6}A{1}J{2}t{0,4}
check AASTQWIULKONHBGDUXXIOL
No Match
check TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
No Match
check WVWAASTQWXXIUKONHBDUXXIOL
Match
check AASTQWIUKIONHBDUXXIOLWWX
No Match
check AASTQWIUKONHBDUXXIOLWWX
Match
check AASVTQWVTIUKONHBDUXXWIOL
Match
check AASTQWIUKONHBDUXXIOL
Match
Check the next post for an ABAP style solution.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi David,
I took a quick look at the pattern and have a question about if the existance of a single "a" or "j" value disqualifies a string. In item (4), the non doublet characters in the note to the side are group "t".
Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?
Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?
The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.
If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed.
Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi John -
Thanks for pointing out some serious sins of omission in the spec. To answer your questions:
Q1:
Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?
A1:
A letter from group a or j that does not occur next to a letter from a or j should be counted as if it were a letter in the group t. That is, in the group-level representation of the string, the letter should be mapped as a "."
Q2:
Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?
A2:
Yes - including "pseudo-t's" that derive from "singleton" a's or j's in the sense of Q1/A1.
You also wrote the following:
"The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.
If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed. "
I think my answers to Q1 and Q2 provide the necessary clarification here - if not, please point out any remaining issues.
Also, you posed Q3:
Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?
A3:
Sorry again! For the sake of this discussion, a search string S is valid only if the number of a and j letters in doublets is at least 10% of the entire length of S.
Thanks very much again for taking the time you have with this, John.
Best
djh
Sorry for the all the stupid typos in the original spec - I've now corrected them all so the spec is at least readable ...
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.