cancel
Showing results for 
Search instead for 
Did you mean: 

How would you code this in Perl/PHP etc.?

Former Member
0 Kudos

Please note that if you submit an answer to this question, your code will be used in a NOT-FOR-PROFIT scientific research program, and you will be given credit for the code in any publication etc.

Problem Statement:

1. You have a string S of 20-80 characters over the alphabet A-Z (this string is drawn at random from a database DS of strings over A-Z.)

2. You have grouped the letters A-Z into the following three groups a, j, and t:

a = A-I j = J-S t = T-Z.

3. Within the string S, you identify all doublets (two consecutive letters) such that the first letter of the doublet is within group a or group j, and the second letter of the doublet is in group a or j.

4. Since you are only interested in these doublets within S, you represent S (for example) as the string S':

AKS . . . . . BL . . . . . . MD . . . . . EE (where each "." represents a letter in group t.)

where the "group-level" representation of the string S' is the string S":

ajj . . . . . aj . . . . . . ja . . . . . aa

5. You want to search the database DS and return all strings that have the form S" at the group-level.

6. Also, and MOST IMPORTANTLY, you want to return a string Sz from the database even if there is only a rough spacing correspondence between Sz and the template string S". The "allowable spacing difference" rule is that for a doublet Dz in Sz to match a doublet Ds" in S' (at the group-level), there must be no more than four characters between the end of Dz and the start of Ds", or vice-versa.

Edited by: David Halitsky on Jun 25, 2010 10:03 PM

Edited by: David Halitsky on Jun 25, 2010 10:05 PM

Accepted Solutions (0)

Answers (6)

Answers (6)

Former Member
0 Kudos

In the original definition, the "." character was described as a letter from the set T-Z. That has been carried forward as the sets were redefined to be characters from the input set (described as the alphabet originally) that did not positionally map as a portion of a doublet.

With that view, the matching of the doublet string was consistent between the search and db string.

In the clarification just added:

CY. . . ARL . . . . . . . . . . MT

with the expectation of finding matches on:

ll. . . blb . . . . . . . . . . bl (where "." stands for anything ...

The addition 'where "." stands for anything' makes the intent unclear. Since 'anything' can include mapped characters, the 'greedy' tendency of the matching alogorithm now comes into play. Say the string to check for matches is:

ll. . . blb . . . . . . . . blblbl

Using the expression default greedy tendency, the string that matches will be:

ll. . . blb . . . . . . . . blblbl

Disabling the greedy tendency, the string that matches will be:

ll. . . blb . . . . . . . . bl

Under a single expression, the following possible match string will not be detected:

ll. . . blb . . . . . . . . blbl

This will be a problem in any scenario where you have multiple matches allowed at a single offset point. Since the requirements desire all possible matches with overlaps, detection of all possible matches with multiple matches at an anchor point requires generation of a match string per possible combination based on proximity size otherwise only a single greedy tendency can be supported. With the example above, I think that corresponds to 72 patterns with a proximity of 4. This expands greatly as the number of blocks of non-significant match strings increase.

Can you clarify the matching rules you are looking for?

Thanks,

John Benson

former_member181923
Active Participant
0 Kudos

John -

Thanks for thinking so carefully about the consequences of each part of the "spec" - it's as important to this mini-project as your ability to effectively code what's ultimately decided on.

Regarding the choice between greedy and non-greedy, would it be possible to turn off greedy so that the effect is to find the nearest match that satisfies the supplied proximity parameter? (In your example, the returned string would be the one that ends in bl, not blblbl.) If so, this is definitely the way to go,

Regarding what should be allowed as ".", would it be possible to code the program so that as it searches, it ignores whatever satisfies a "." character in the search string?

Yes, this would allow the program to match a string s to a search string S even though S might have one or more "doublets of interest" that S does not have. But this is OK from an empirical perspective, because it might be the case that via evolution (mutation), S might have "lost" doublets of interest that s has maintained.

If the two rules suggested above (for the greedy issue and the singleton wildcard issue) are NOT sufficiently explicit, please let me know what outstanding questions need to be addressed inorder to make the rules sufficiently explicit.

Again, thanks so much for your "ounce of prevention" approach here.

Best

djh

Former Member
0 Kudos

Hi David,

The "." characters can be made to match any character as long as a greedy or non-greedy (minimal match size) preference is used. That restricts the matches to a maximum of 1 for any starting point. With that detemined, I need to change the generated match pattern to go for minimal match size. That should not take too much to implement.

I will let you know when I get a chance to do that.

former_member181923
Active Participant
0 Kudos

John -

You wrote:

"That restricts the matches to a maximum of 1 for any starting point"

That's perfectly OK for the purposes of this research tool, because overall, your program will be choosing multiple starting points.

Yes - please let us know in this thread when the final version is done, and we can make off-line arrangements for making the program available. I can make a new post in the Bioinformatics Wiki that I started last year, and point folks from this thread to that Wiki entry.

Again, words cannot express .... you know the rest ....

djh

former_member181923
Active Participant
0 Kudos

John -

You wrote:

"That restricts the matches to a maximum of 1 for any starting point"

That's perfectly OK for the purposes of this research tool, because overall, your program will be choosing multiple starting points.

Yes - please let us know in this thread when the final version is done, and we can make off-line arrangements for making the program available. I can make a new post in the Bioinformatics Wiki that I started last year, and point folks from this thread to that Wiki entry.

Again, words cannot express .... you know the rest ....

djh

former_member181923
Active Participant
0 Kudos

Also, John - let me know if you want it, and I will send you a ZIP of the full target DB - it's about 3500 entries, I think ... will also send along the "reduced" expansion rules that don't generate so many final doublets of interest ....

djh

Former Member
0 Kudos

Hi David,

The target DB would be helpful to spot check that the "greedy" settings do what I expect. The reduced expansion set would also be helpful. I could do a quick sanity check with those and five or so example search patterns. That should uncover any basic flaws in the algorithm.

Thanks,

John Benson

former_member181923
Active Participant
0 Kudos

John -

ZIP of target DB and reduced expansion rules comng at you shortly.

Regarding example search patterns, I have a suggestion that will reduce the chance of human error (by me) if I attemot to construct the example search patterns manually.

In particular, when you receive the target DB, could you write a one-off that simply translates all the sequences in this DB into sequences containing just "."s and doublets of interest?

That way, it would be easy for you to select search patterns of varying length and see if they find themselves, and then, by altering spacing in the selectd patterns, you could easily check your proximity logic.

I realize this is yet more code - believe me, I really do - but inasmuch as you've truly professionalized this effort so far, it would be great if you could find the time to write the one-off and thereby keep the example search pattern generation technique equally professional ...

Best

djh

Former Member
0 Kudos

I did a quick look at this and the expanded requirements appear to have a conflict with the example information. If the doublets allow an individual letter to compose part of two adjacent doublets, I would expect the representation for that letter would have to be consistent.

Take sequence RVT as a simple example. I would expect that to convert to lbl. This works fine with the samples based on

RV   => lb
 VT =>   bl

However, if I look at sequence DLK

DL  => ll
 LK =>  bl

This does not make sense as I read it because the L translates to l in the first doublet and b in the second doublet. This would work if the sequence expanded but you said there would not be an expansion due to the doublet mapping.

Can you explain how this should be handled?

Thanks,

former_member181923
Active Participant
0 Kudos

Sorry John - that was an idiot typo - DL should be "lb", not "ll".

I'll proofread for any other typos like that, but in case I miss one, here are the two rules for determining l and b:

b: FLIMVPAWG

l: STYHQNKDECR

In other words, your instinct was correct. The mapping from single letter to b or l is invariant ...

Again, my apologies ...

djh

Former Member
0 Kudos

I expanded the program and did a test run. Some sample results are as follows:

Pattern "Eval 1" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}"

RAY MGKDYYQTLGLARGASDEEIKRAYRRQA

lbl bbllllllbbbblbbllllbllbllllb

lbl

match (1) at offset 19

match (2) at offset 22

Pattern "Eval 2" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}"

MGKDYTY MGKDYYQTLGLARGASDEEIKRAYRRQA

bblllll bbllllllbbbblbbllllbllbllllb

bl match (1) at offset 1 Pattern "Eval 3" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}" MGL MGKDYYQTLGLARGASDEEIKRAYRRQA bbb bbllllllbbbblbbllllbllbllllb b match (1) at offset 9 match (2) at offset 10 Pattern "Eval 4" found in "gi|d1hdj__ a.2.3.1| HSP40 {Human (Homo sapiens)}" APXAY MGKDYYQTLGLARGASDEEIKRAYRRQA bb,bl bbllllllbbbblbbllllbllbllllb b,{0,5}bl

match (1) at offset 10

Pattern "Eval 3" found in "gi|d1hdj__ a.2.3.2| HSP40 {Human (Homo sapiens)}"

MGL MGKDYYQTLGLARGASDEEIXWRYSKRAYRRQA

bbb bbllllllbbbblbbllllb,blllllbllllb

b

match (1) at offset 9

match (2) at offset 10

Pattern "Eval 4" found in "gi|d1hdj__ a.2.3.2| HSP40 {Human (Homo sapiens)}"

APXAY MGKDYYQTLGLARGASDEEIXWRYSKRAYRRQA

bb,bl bbllllllbbbblbbllllb,blllllbllllb

b,{0,5}bl

match (1) at offset 10

The output lines are as follows:

Line 1 Name of pattern to look for followed by the name of the DB line with a match

Line 2 Pattern to look for followed by the DB doublet source

Line 3 Doublets for pattern to look for followed by doublets for the pattern to search

Line 4 Search pattern determined with proximity allowance

Line 5+ Match number and offset in pattern where found (offset is actual position number in string)

I think this is consistent with the expected results but I am not 100% certail. Do you have any realistic DB lines and search patterns that should match? The original examples in this thread had sparsely populated significant doublets. Once I applied the doublet expansion, 399 of the 676 possible variations assuming full alphabet usage were then defined. This changed the characteristics to a higher density of significant match positions so test cases are harder to determine.

former_member181923
Active Participant
0 Kudos

Hi John -

Those results look good, at least for contiguous search patterns.

One clarifiicatiion and one question:

1) Clarification: there will only ever be 20 letters total, corresponding to the 20 core amino acids that occur in proteins. So, there will never be search patterns or DB strings with the letters B, J,O,U,X,Z. I don't think the program will have to be adjusted to take this into account, I just wanted you to be aware of why some letters are "missing". Also, the fact that the expansion rules produce 399 out of the 400 possible combinations of 20x20 letters means that I have to "cut back" the productivity of the expansion rules - I thought this might be the case and I do have a principled way to cut back the expansion rules so that far less than 399 final doublets are produced by them.

2) Question: Will the program as it stands now allow me to enter a non-contiguous search pattern like:


CY. . . ARL . . . . . . . . . . MT

with the expectation of finding matches on:


ll. . . blb . . . . . . . . . . bl  (where "." stands for anything ...

Finally, regarding testing of the PERL as it now stands - if you provide a final copy, I will test thoroughly against the real database with search strings of actual empirical interest, where I know the correct answers in advance.

Again, let me thank you very much again for what you've done here ... almost there. I think.

Best

djh

Former Member
0 Kudos

The ABAP won't paste into the editor correctly so I will send you an email copy.

former_member181923
Active Participant
0 Kudos

John u2013 First, let me thank you for what youu2019ve done here u2013 I cannot tell you how much itu2019s appreciated. Second, if I could lean on you for just a little more of your creativity, Iu2019d like to ask if youu2019d consider doing some revisions that will really add to the utility of what youu2019ve already done. These revisions do not involve the core of what youu2019ve done u2013 the final program will still try to match a search string to corresponding strings in a file, depending again on the position of certain doublets in the search string. What needs to change are: 1) the way in which u201Capprovedu201D doublets are initially identified and the way in which the group representation of these u201Capprovedu201D doublets is determined; 2) the way in which the input file is processed (the one that is searched against) 3) the way in which the matches are ouput. The best way to spec these changes is to propose the formal parametes for an actual Perl command line and then explain what processing should result from each of these parameters. The command line we need can be illustrated as follows:


matchem u2013core core.txt u2013xpnd xpnd.txt u2013prox n u2013targ db.txt u2013s u2018u2026.u2019
where: matchem is the name of the program core.txt is a file with an initial list of u201Capprovedu201D doublets and their group representation xpnd.txt is a file with instructions as to how to expand the initial u201Ccoreu201D list of approved doublets into a final list of u201Capprovedu201D doublets (this file may be empty, in which case u201Ccore.txtu201D is the final list of approved doublets) the n value of prox sets the u201Cproximity valueu201D for the match (in the original spec, this was 4) db.txt is the file of strings which the program will match the input string against. the u201C-su201Dstring is the search string u2013 the one the program will try to find matches for in db.txt. Details on the core.txt file Here is the u201Ccore.txtu201D file containing the initial list of u201Capprovedu201D doublets and their group representation:

FF bb
FL bb 
LL bb
FA bb
LK bl
LS bl
LT bl
LY bl
LG bb
IP bb
IA bb
ML bb
VL bb
VS bl
VT bl
VG bb
SL lb
SP lb
PV bb
PG bb
TL lb
TP lb
AL bb
AV bb
AP bb 
AN bl
AS bl
YP lb
HP lb
QR ll
KN ll
KK ll
DL ll
EN ll
EK ll
ER ll 
WR bl
RV lb
RY ll
RW lb
RR ll
SG lb
RR ll
GF bb
GL bb
GY bl
GK bl
GW bb
GR bl
GS bl

As you can see, the group representation of the u201Capprovedu201D doublets uses two letters u2013 a lower-case B and a lower-case L. Details on the xpnd.txt file Here is the xpnd.txt file:

F LIVPA
L FIMVPAW
I FLMVPAG
M LIVPAG
V FLIMPAG
P FLIMVAW
A FLIMVPG
W LPG
G IMVAW
S TYHQNKDECR
T SNKDER
Y SHQNDCR
H SYQNDCR
Q SYHKECR
N STYHKDECR
K STQNDER
D STYHNKECR
E STQNKDR
C SYHQNDR
R STYHQNKDEC
And here is an example of the way this file works in conjunction with the u201Ccore.txtu201D file to generate the final list of u201Capprovedu201D doublets. 1) VT is an approved doublet in the core.txt file 2) V is associated with the letters FLIMPAG in the xpnd.txt file 3) T is associated with the letters SNKDER in the xpnd.txt file 4) Therefore, new approved doublets should be generated by taking each element of with each element of , in that order (for example, FS, FN, FK, FD, FE, FR) 5) For each new doublet formed in this way, the group representation should be the same as that of the original doublet in the core.text file (in this case, the group representation for each new doublet would be u201Cblu201D, since u201Cblu201D is the group representation of the original doublet VT 6) Note that the final u201Capprovedu201D doublet list should be u201Cde-duplicatedu201D so that no doublet appears more than once. Details on db.txt file Each string in the db.txt file will be on one line of the file, and each string will be preceded by a u201Cstring idu201D line with information about the string itself. Like this:

gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)}
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGC
Use of the u201Cstring idsu201D in the output file of matches The u201Cstring idu201D for each string in the db.txt file will be used in the output of the program u2013 in particular, when matches are reported in the output of the program, each match must be preceded (on the same line) by the sequence ID of the string in which the match was found. For example:

gi|d1hdj__ a.2.3.1|(-) HSP40 {Human (Homo sapiens)} MGKDYYQTLGLARGASDEEIKRAYRRQA
Multiple matches within the same string in the db.txt file We want to look for ALL matches to a given search string within each string of the db.txt file. For shorter search strings, there may be many matches within a single string in the db.txt file. Again, many many thanks, John ...

Former Member
0 Kudos

There are a couple ways to view the proximity of the doublets. For this response I picked proximity as being relative to the current matching doublet and not to the absolute position in the string.

Due to the nature of the matching, my solution is based upon creating the doublet pattern for both the search string and the strings to be evaluated. The pattern for the search string is then converted to an expression that gives the allowances for the position variation. The example perl is:

#!/usr/bin/perl  
$starting = shift;
$search = create_doublet( $starting ); 
$search = create_match( $search ); 

print "Input string\n     $starting\n     $search\n\n";

$infile = shift;  
open( INFILE, $infile );

while (<INFILE>) {    
  chomp;
  my $line = $_;
  my $str = create_doublet( $line );
  print "check  $line\n";

  if( $str =~ /$search/ ) {
    print "Match\n\n";
  } else {
    print "No Match\n\n";
  }

}
close( INFILE );

sub create_doublet {
  $new = shift;
  $new =~ s/[A-I]/A/g;
  $new =~ s/[J-S]/J/g;
  $new =~ s/[T-Z]/t/g;
  $new =~ s/^Jt/tt/g;
  $new =~ s/tJ\s*$/tt/g;
  $new =~ s/^At/tt/g;
  $new =~ s/tA\s*$/tt/g;
  $new =~ s/tAt/ttt/g;
  $new =~ s/tJt/ttt/g;
  return $new;
}

sub create_match {
  my $pattern;
  $_ = shift;

  @fields = /A+|J+|t+/g;
  foreach (@fields) {
    $fld = $_;
    $char = substr( $fld, 0, 1);
    $pattern .= $char;
    $cnt = length "$fld";
    if( $fld =~ /t/ ) {
      $low = $cnt - 3;
      $high = $cnt + 3;
      if( $low < 0 ) {
        $low = 0;
      }
      $pattern .= "{" . "$low" . "," . "$high" . "}";
    } else {
      $pattern .= "{" . "$cnt" . "}";
    }

  }

  if( $pattern =~ /^(A|J)/ ) {
    $pattern = "t{0,4}" . $pattern;
  }
  if( $pattern =~ /(A|J)\{[0-9]+\}\s*$/ ) {
    $pattern .= "t{0,4}";
  }


  return $pattern;
  
}

When this is run using an input file as follows:

AASTQWIULKONHBGDUXXIOL
TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
WVWAASTQWXXIUKONHBDUXXIOL
AASTQWIUKIONHBDUXXIOLWWX
AASTQWIUKONHBDUXXIOLWWX
AASVTQWVTIUKONHBDUXXWIOL
AASTQWIUKONHBDUXXIOL

The result is:

Input string
     AASTQWIUKONHBDUXXIOL
     t{0,4}A{2}J{1}t{2,8}J{3}A{3}t{0,6}A{1}J{2}t{0,4}

check  AASTQWIULKONHBGDUXXIOL
No Match

check  TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
No Match

check  WVWAASTQWXXIUKONHBDUXXIOL
Match

check  AASTQWIUKIONHBDUXXIOLWWX
No Match

check  AASTQWIUKONHBDUXXIOLWWX
Match

check  AASVTQWVTIUKONHBDUXXWIOL
Match

check  AASTQWIUKONHBDUXXIOL
Match

Check the next post for an ABAP style solution.

Former Member
0 Kudos

Hi David,

I took a quick look at the pattern and have a question about if the existance of a single "a" or "j" value disqualifies a string. In item (4), the non doublet characters in the note to the side are group "t".

Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?

Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?

The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.

If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed.

Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?

former_member181923
Active Participant
0 Kudos

Hi John -

Thanks for pointing out some serious sins of omission in the spec. To answer your questions:

Q1:

Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?

A1:

A letter from group a or j that does not occur next to a letter from a or j should be counted as if it were a letter in the group t. That is, in the group-level representation of the string, the letter should be mapped as a "."

Q2:

Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?

A2:

Yes - including "pseudo-t's" that derive from "singleton" a's or j's in the sense of Q1/A1.

You also wrote the following:

"The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.

If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed. "

I think my answers to Q1 and Q2 provide the necessary clarification here - if not, please point out any remaining issues.

Also, you posed Q3:

Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?

A3:

Sorry again! For the sake of this discussion, a search string S is valid only if the number of a and j letters in doublets is at least 10% of the entire length of S.

Thanks very much again for taking the time you have with this, John.

Best

djh

former_member181923
Active Participant
0 Kudos

Sorry for the all the stupid typos in the original spec - I've now corrected them all so the spec is at least readable ...