How would you code this in Perl/PHP etc.?

Former Member · ‎06-25-2010

Please note that if you submit an answer to this question, your code will be used in a NOT-FOR-PROFIT scientific research program, and you will be given credit for the code in any publication etc.

Problem Statement:

1. You have a string S of 20-80 characters over the alphabet A-Z (this string is drawn at random from a database DS of strings over A-Z.)

2. You have grouped the letters A-Z into the following three groups a, j, and t:

a = A-I j = J-S t = T-Z.

3. Within the string S, you identify all doublets (two consecutive letters) such that the first letter of the doublet is within group a or group j, and the second letter of the doublet is in group a or j.

4. Since you are only interested in these doublets within S, you represent S (for example) as the string S':

AKS . . . . . BL . . . . . . MD . . . . . EE (where each "." represents a letter in group t.)

where the "group-level" representation of the string S' is the string S":

ajj . . . . . aj . . . . . . ja . . . . . aa

5. You want to search the database DS and return all strings that have the form S" at the group-level.

6. Also, and MOST IMPORTANTLY, you want to return a string Sz from the database even if there is only a rough spacing correspondence between Sz and the template string S". The "allowable spacing difference" rule is that for a doublet Dz in Sz to match a doublet Ds" in S' (at the group-level), there must be no more than four characters between the end of Dz and the start of Ds", or vice-versa.

Edited by: David Halitsky on Jun 25, 2010 10:03 PM

Edited by: David Halitsky on Jun 25, 2010 10:05 PM

Former Member · ‎07-06-2010

In the original definition, the "." character was described as a letter from the set T-Z. That has been carried forward as the sets were redefined to be characters from the input set (described as the alphabet originally) that did not positionally map as a portion of a doublet.

With that view, the matching of the doublet string was consistent between the search and db string.

In the clarification just added:

CY. . . ARL . . . . . . . . . . MT

with the expectation of finding matches on:

ll. . . blb . . . . . . . . . . bl (where "." stands for anything ...

The addition 'where "." stands for anything' makes the intent unclear. Since 'anything' can include mapped characters, the 'greedy' tendency of the matching alogorithm now comes into play. Say the string to check for matches is:

ll. . . blb . . . . . . . . blblbl

Using the expression default greedy tendency, the string that matches will be:

ll. . . blb . . . . . . . . blblbl

Disabling the greedy tendency, the string that matches will be:

ll. . . blb . . . . . . . . bl

Under a single expression, the following possible match string will not be detected:

ll. . . blb . . . . . . . . blbl

This will be a problem in any scenario where you have multiple matches allowed at a single offset point. Since the requirements desire all possible matches with overlaps, detection of all possible matches with multiple matches at an anchor point requires generation of a match string per possible combination based on proximity size otherwise only a single greedy tendency can be supported. With the example above, I think that corresponds to 72 patterns with a proximity of 4. This expands greatly as the number of blocks of non-significant match strings increase.

Can you clarify the matching rules you are looking for?

Thanks,

John Benson

Former Member · ‎06-30-2010

I did a quick look at this and the expanded requirements appear to have a conflict with the example information. If the doublets allow an individual letter to compose part of two adjacent doublets, I would expect the representation for that letter would have to be consistent.

Take sequence RVT as a simple example. I would expect that to convert to lbl. This works fine with the samples based on

RV   => lb
 VT =>   bl

However, if I look at sequence DLK

DL  => ll
 LK =>  bl

This does not make sense as I read it because the L translates to l in the first doublet and b in the second doublet. This would work if the sequence expanded but you said there would not be an expansion due to the doublet mapping.

Can you explain how this should be handled?

Thanks,

Former Member · ‎06-29-2010

The ABAP won't paste into the editor correctly so I will send you an email copy.

Former Member · ‎06-29-2010

There are a couple ways to view the proximity of the doublets. For this response I picked proximity as being relative to the current matching doublet and not to the absolute position in the string.

Due to the nature of the matching, my solution is based upon creating the doublet pattern for both the search string and the strings to be evaluated. The pattern for the search string is then converted to an expression that gives the allowances for the position variation. The example perl is:

#!/usr/bin/perl  
$starting = shift;
$search = create_doublet( $starting ); 
$search = create_match( $search ); 

print "Input string\n     $starting\n     $search\n\n";

$infile = shift;  
open( INFILE, $infile );

while (<INFILE>) {    
  chomp;
  my $line = $_;
  my $str = create_doublet( $line );
  print "check  $line\n";

  if( $str =~ /$search/ ) {
    print "Match\n\n";
  } else {
    print "No Match\n\n";
  }

}
close( INFILE );

sub create_doublet {
  $new = shift;
  $new =~ s/[A-I]/A/g;
  $new =~ s/[J-S]/J/g;
  $new =~ s/[T-Z]/t/g;
  $new =~ s/^Jt/tt/g;
  $new =~ s/tJ\s*$/tt/g;
  $new =~ s/^At/tt/g;
  $new =~ s/tA\s*$/tt/g;
  $new =~ s/tAt/ttt/g;
  $new =~ s/tJt/ttt/g;
  return $new;
}

sub create_match {
  my $pattern;
  $_ = shift;

  @fields = /A+|J+|t+/g;
  foreach (@fields) {
    $fld = $_;
    $char = substr( $fld, 0, 1);
    $pattern .= $char;
    $cnt = length "$fld";
    if( $fld =~ /t/ ) {
      $low = $cnt - 3;
      $high = $cnt + 3;
      if( $low < 0 ) {
        $low = 0;
      }
      $pattern .= "{" . "$low" . "," . "$high" . "}";
    } else {
      $pattern .= "{" . "$cnt" . "}";
    }

  }

  if( $pattern =~ /^(A|J)/ ) {
    $pattern = "t{0,4}" . $pattern;
  }
  if( $pattern =~ /(A|J)\{[0-9]+\}\s*$/ ) {
    $pattern .= "t{0,4}";
  }


  return $pattern;
  
}

When this is run using an input file as follows:

AASTQWIULKONHBGDUXXIOL
TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
WVWAASTQWXXIUKONHBDUXXIOL
AASTQWIUKIONHBDUXXIOLWWX
AASTQWIUKONHBDUXXIOLWWX
AASVTQWVTIUKONHBDUXXWIOL
AASTQWIUKONHBDUXXIOL

The result is:

Input string
     AASTQWIUKONHBDUXXIOL
     t{0,4}A{2}J{1}t{2,8}J{3}A{3}t{0,6}A{1}J{2}t{0,4}

check  AASTQWIULKONHBGDUXXIOL
No Match

check  TTQUIEIWOKDIJNNIUJEJHSXXZZRUEIURH
No Match

check  WVWAASTQWXXIUKONHBDUXXIOL
Match

check  AASTQWIUKIONHBDUXXIOLWWX
No Match

check  AASTQWIUKONHBDUXXIOLWWX
Match

check  AASVTQWVTIUKONHBDUXXWIOL
Match

check  AASTQWIUKONHBDUXXIOL
Match

Check the next post for an ABAP style solution.

Former Member · ‎06-25-2010

Hi David,

I took a quick look at the pattern and have a question about if the existance of a single "a" or "j" value disqualifies a string. In item (4), the non doublet characters in the note to the side are group "t".

Am I reading this correctly that the existance of a single (cannot be consumed in a doublet) "a" or "j" disqualifies a string from being able to create a group level representation?

Second question. I am assuming the allowable difference also includes leading or trailing sequence of "t" characters. Is that correct?

The way I am reading this requirement, it appears that most randomly generated strings would fail to be valid to generate a group level representation because any occurrance of an an "a" or "j" bounded only by "t", start of string (^) , or end of string ($) would fail to meet the intervening "t" only character requirement.

If single "a" and "j" characters do not disqualify a string from having a group level representation, a little clarification on the problem description is needed.

Last question. A search string consisting of only characters from "t" would generate a group level representations with no doublets. Is this also a valid search string? If yes, does the allowable difference come into play concerning the length of the strings or is the allowable difference only considered with a doublet?

former_member181923 · ‎06-25-2010

Sorry for the all the stupid typos in the original spec - I've now corrected them all so the spec is at least readable ...

How would you code this in Perl/PHP etc.?

Accepted Solutions (0)

Answers (6)

Answers (6)

Re: ASEGURAMIENTO DE TABLAS

Re: Getting error while Creating role in BRM

Re: How to enable the add, delete,edit system on s...

Re: CJS-30252 Running priviledged function stop of...

Successfactors login issue.