cancel
Showing results for 
Search instead for 
Did you mean: 

perl versus other "scripting" languages when doing string operations

Former Member
0 Kudos

I've been told that perl is a "scripting" language like the other languages mentioned in this forum.

If that's true, can these other languages handle the following spec as well as perl can? (See spec at end of this post.)

Or is perl stronger in string operations than the other scripting languages mentioned here?

Here's the spec:

1. I give your program a twenty-letter alphabet (any twenty letter alphabet)

For example:

ABCDEFGHIJKLMNOPQRST

2. I also give your program four groups (any four groups) of letters in this alphabet:

For example:

s: A,B,C,D,E

p: F,G,H,I,J

d: K,L,M,N,O

e: P,Q,R,S,T

3. I also give your program a sequence over the twenty-letter alphabet that I gave you in Step (1) above:

For example:

ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA

4. Given this sequence,you search for pairs of adjacent letters (x,y) where X and y are from different groups (the groups defined in Step (2) above.)

Also, you return the results of this search by giving me back the following two strings:

ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA

ABCD(sp)GHI(pd)LMN(de)QRSTSRQ(ed)NML(dp)IHG(ps)DCBA

5. Note: if I give you a sequence that contains "overlapping" ordered pairs like:

...EFK...

then you ignore the second ordered pair. That is, you return:

...(EF)K

Accepted Solutions (0)

Answers (19)

Answers (19)

former_member181923
Active Participant
0 Kudos

closing question to get below 10

former_member181923
Active Participant
0 Kudos

At Craig's suggestion, I've set up a WIKI page for this problem here:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticCodingProblem+1

Contributors to this thread should feel free to post their solutions as child-pages of the above page.

Contributors of multiple solutions in different languages should post each solution on a different child-page.

I will post Bill's perl and Gunter's C as solutions to Problem 2, since their programs do more than what was asked for in the original spec given in the top-post of this thread.

Former Member
0 Kudos

Here are the txt files to try against your scripts.

Would also love to see everyone save their code snippets [here|https://wiki.sdn.sap.com/wiki/display/Snippets] labeled "bioinformatic" and of course what language.

former_member181923
Active Participant
0 Kudos

Hi Craig -

Thanks for posting those files.

I just want to clarify that the second and third input files are relevant to the larger programs that Bill and Gunter wrote in perl and C.

I will be explaining these programs in "Problem 2" in the EmergTech-Bioinformatic WIKI., but folks here can probably figure out how they operate on the three input files just by looking at Gunter and Bill's code.

Former Member
0 Kudos

just wanted to add the JAVASCRIPT version, but this forum doesn't let me post it. "METHOD NOT IMPLEMENTED" it says. hmmm. no idea what that means.

anton

Former Member
0 Kudos

very odd, email me the script and I'll have them double check on our dev system what is what.

Once David creates the wiki area we should be able to post there as well - although everyone could post their code samples [here|https://wiki.sdn.sap.com/wiki/display/Snippets] now and label them with a tag specific to this topic.

Former Member
0 Kudos

Java version!


/**
 * 
 * @author Gregor Brett
 *
 */

public class djh
{
	public static void main(String[] args)
	{
		String	 a = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA";
		String[] s = {"A","B","C","D","E"};
		String[] p = {"F","G","H","I","J"};
		String[] d = {"K","L","M","N","O"};
		String[] e = {"P","Q","R","S","T"};
		String[] names = {"s","p","d","e"};
		String[][] groups = {s,p,d,e};
		String[] pairs = new String[(groups.length)*((groups[0].length*groups[0].length)*(groups.length-1))];
		String[] pairs_codes = new String[pairs.length];
		int count = 0;
		
		for(int i=0;i<groups.length;i++)
		{
			for(int n=0;n<groups<i>.length;n++)
			{
			   for(int m=0;m<groups.length;m++)
			   {
			   	   if(i != m)
			   	   {
					   for(int l=0;l<groups[m].length;l++)
					   {
					      pairs[count] = groups<i>[n] + groups[m][l];
					      pairs_codes[count] = names<i> + names[m];
						  count++;
					   }
			   	   }
			   }
			}
		}
		String ai = a;
		for(int i=0;i<pairs.length;i++)
		{
			a = a.replaceFirst(pairs<i>, "("+pairs<i>+")");
			ai = ai.replaceFirst(pairs<i>, "("+ pairs_codes<i> +")");
		}
		System.out.println(a);
		System.out.println(ai);		
	}
}

Former Member
0 Kudos

Python version!


##############################
#  @author: Gregor Brett     #
##############################

import re

a = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA"
s = ["A","B","C","D","E"];
p = ["F","G","H","I","J"];
d = ["K","L","M","N","O"];
e = ["P","Q","R","S","T"];
names = ["s","p","d","e"];
groups = [s,p,d,e];
pairs = [""]*(len(groups)*((len(s)**2) *(len(groups)-1))) 
pairs_codes = [""]*len(pairs)
c = 0

for i in range(0, len(groups)):
   for n in range(0, len(groups<i>)):
      for m in range(0, len(groups)):
         if i != m:
            for l in range(0, len(groups[m])):
               pairs[c] = groups<i>[n] + groups[m][l]
               pairs_codes[c] = names<i> + names[m]
               c = c + 1
ai = a

for p in range(0, len(pairs)):
   regex = re.compile(pairs[p])
   a = regex.sub('('+pairs[p]+')', a, count=1)
   ai = regex.sub('('+pairs_codes[p]+')', ai)
   
print a
print ai

former_member181923
Active Participant
0 Kudos

I just sent Craig the files (see copy of email below.)

Note to Anton: heh heh heh ... I like your style!

Email to Craig:

Craig -

In case anyone wants to try the substring routine and the parenthesization of the nucleotide string, the attached zipfile has:

1) Gunter's c code: 20let.c;

2) Gunter's c exe: 20let.exe

3) Bill's latest perl: 20let-re.pl

4) three input files:

val1a.txt

val2a.txt

val3b.txt

5) output file: fileout.txt

On to the WIKI-page!!!!

Thanks very much again. You're being very kind.

Best

djh

Former Member
0 Kudos

here's a solution using the exact same algorithm like the one i posted earlier. only the language used this time is a bit 'chattier' )

(or my programming skill in this language is).

looking forward to see if anyone guesses the language used.


*&---------------------------------------------------------------------*
*& Report  ZTW_REGEX1
*&
*&---------------------------------------------------------------------*
*& created for djh challenge; ACW210308
*&
*&---------------------------------------------------------------------*

REPORT  ZTW_REGEX1.

data: l_target    type string,
      l_targes    type string,
      lt_group    type table of string,
      ll_group    type string,
      ll_grouq    type string,
      lt_groupid  type table of string,
      ll_groupid  type string,
      ll_grouqid  type string,
      l_ind       type i,
      l_ine       type i,
      l_pattern   type string,
      l_replace   type string.

l_target = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.
l_targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.

append 'ABCDE' to lt_group. append 's' to lt_groupid.
append 'FGHIJ' to lt_group. append 'p' to lt_groupid.
append 'KLMNO' to lt_group. append 'd' to lt_groupid.
append 'PQRST' to lt_group. append 'e' to lt_groupid.

loop at lt_group into ll_group.
  l_ind = sy-tabix.
  loop at lt_group into ll_grouq.
    l_ine = sy-tabix.
    if l_ind <> l_ine.
      read table lt_group index l_ind into ll_group.
      read table lt_group index l_ine into ll_grouq.
      read table lt_groupid index l_ind into ll_groupid.
      read table lt_groupid index l_ine into ll_grouqid.
      concatenate '([' ll_group '][' ll_grouq '])' into l_pattern.
      replace regex l_pattern in l_target with '($1)'.
      concatenate '(' ll_groupid ll_grouqid ')' into l_replace.
      replace regex l_pattern in l_targes with l_replace.
    endif.
  endloop.
endloop.

write: / l_target.
write: / l_targes.

anton

Former Member
0 Kudos

just found out that this forum software removes &lt; and &gt; even if it's within a code section.

well, so it's up to you, dear reader, to find out the missing &lt;&gt; in above's code fragment.

anton

former_member181923
Active Participant
0 Kudos

In Bill's perl program posted above, he put a comment in the "substrings" routine to indicate that he had a faster version in mind.

Here is his "regex" recoding of the "substrings" routine. He's pretty sure it will run faster than the original.


sub substrings {
    my ($x1, $x2) = @_;
    printf("\nlengths %d - %d : \n", $x1, $x2);
    pos($C) = 0;
    while ($C =~ /[spdt]/g) {
	pos($C)-1+$x1 < length $C or last;
	my $j = substr($C, pos($C)-1+$x1, $x2-$x1);
	while ($j =~ /[spdt]/g) {

	    my $a = my $c = substr($C, pos($C)-1, pos($j)+$x1);
	    my $b = substr($B, pos($C)-1, pos($j)+$x1);
	    $b =~ tr/()//d;
	    $c =~ tr/a-z//cd;
	    print "$a|$b|$c\n";
	}
    }
}

Former Member
0 Kudos

a quick PHP solution:


<?
// djh challenge

$target = $targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA';
$group[1] = 'ABCDE'; $groupid[1] = "s";
$group[2] = 'FGHJI'; $groupid[2] = "p";
$group[3] = 'KLMNO'; $groupid[3] = "d";
$group[4] = 'PQRST'; $groupid[4] = "e";


for($i=1; $i < count($group)+1; $i++) {
  for($j=1; $j < count($group)+1; $j++) {
    if ($i <> $j) {
		$target = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '($1)', $target);
		$targes = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '('.$groupid[$i].$groupid[$j].')', $targes);		
    }
  }
}
echo $target. "\n" . $targes;

//fulfills 5. & yields
//ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
//QED.
?>

regards, anton

former_member181923
Active Participant
0 Kudos

Anton -

Thanks for the contribution - it makes the thread that much more interesting, and it was interesting already.

If we can get a Python example and a LISP or SCHEME example, I'm going to suggest that we all try to do

a "cross-walk" of all the different programs, with clear explanations of exactly how construct A in program X does the

job of construct B in program Y, construct C in program Z, etc.

Sometimes, it's easier to learn one language by learning several at the same time.

Best regards

djh

Former Member
0 Kudos

very nice, a lot neater than mine.

@David can you email me your text files I'll make them available for everyone here so then everyone can work with the same data set

former_member181923
Active Participant
0 Kudos

Craig -


At the risk of ticking you off, I'm going to make a more "comprehensive" 
suggestion regarding file-sharing.

In 2005, I took my own web-server off line from the commercial ISP where
it was housed. (It was costing me about $400 per month to keep it running
and accesible with reasonable response time.)

Although it is an old SUN RaQ500 "appliance server" that is no longer
officially supported by Sun, it is a perfectly serviceable machine that
still runs very nicely under Linux with several PHP bulletin boards fully
configured on it.

Also, I know of two people that would probably "admin" it for free.

So how about if I crate it up and send it to you to be mounted somewhere 
in SAP land, with a private IP that would be given to those in this 
little collaboration group that seems to be starting up nicely here.

If you/SAP were willing to do this, then it would be very easy for me to 
continue providing interesting scripting problems that would teach anyone
interested a lot about bioinformatics.

Plus, the bulletin board capability would be more convenient and take some
load off this forum (we could post back here when we have something of 
more-than-usual interest to report.)

Do I have a hidden agenda here?

Absolutely.

As I said many times before, I'd like to drag SAP kicking and screaming into
the world of bioinformatics because eventually, someone somewhere is going to
realize that the way to do bioinformatics is to structure it as an SCM problem
(what makes what, where do you get it from, what else uses it, etc. etc. etc.??)
And that someone somewhere might as well be SAP rather than Oracle.

So what I'm suggesting would be an absoutely free way to set up some
infrastructure that would allow forward motion in this direction, 
depending, of course, on the willigness of folks here like Ethan and Alvaro
etc to continue contributing code.

See? I told you I was going to tick you off.

Sorry! I figured it was worth a shot.

Former Member
0 Kudos

Uh - NO.

There's no reason for it, we have an entire Wiki Code Gallery, an entire Wiki for collaborating and the forums already you offer us nothing new. We've had this discussion in the past about using the Wiki (you've still not taken advantage of that).

So if you want to send me the data files I'll attach them to the forum if not then fine but we're certainly not going to host a server just for this when we have all the capabilities already.

So you want to get the community more interested then do your part and put the info into the existing tools, if the community is interested enough then SAP will begin to notice otherwise you're not going to effect much change.

former_member181923
Active Participant
0 Kudos

See - I told you it would tick you off !!!!

Seriously ... fair enough.

I'll send you the files tonight and then I'll start putting some stuff out on the "icky-wicky" (just kidding, just kidding ...)

Former Member
0 Kudos

believe me that did not "tick me off" very little in this topic area could do that

former_member181923
Active Participant
0 Kudos

Here are some comments from Bill on the situation as he sees it:

I found the ruby code and ran it on my Linux box. It takes no arguments and prints out:

mannb:dh $ ruby 20let.ruby

ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA

Since I don't know ruby, and it doesn't do the same things, it's hard to evaluate. Certainly I'd need to look at a ruby manual to understand it. The regular expression is used to locate the stuff in (), but as Ethan said, it doesn't try to find the substrings, at least not yet.

I translated the C code to perl in 2-3 hours. I didn't try to figure out the best possible algorithm, and I tried to make the programs parallel so they would be easy to compare. The C program is not optimized for speed, space, or style. My perl program is fairly simple if you can read Perl regular expressions, an d understand the options of tr/// I'm using.

Python uses what is basically the same regular expression subroutine package as perl. Ruby seems different.

We used perl because I like it and it was flexible and speedy enough for most things. If not, C's my choice.

esjewett
Active Contributor
0 Kudos

David,

Well, it kept niggling, so over lunch I updated it to be a little prettier, tested and fixed it with multiple characters, and added an example of how to pass it arguments. (Edited to add that it also returns both strings requested.) I think I'm satisfied with it now


def david_halitsky_challenge(seq = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA", 
                                            groups = { 's' => %w(A B C D E),
                                             'p' => %w(F G H I J),
                                             'd' => %w(K L M N O),
                                             'e' => %w(P Q R S T) })
 
# Arrays of letters *not* in each group

  gls = groups.keys
 
  notgroups = { gls[0] => groups[gls[1]]+groups[gls[2]]+groups[gls[3]],
                gls[1] => groups[gls[0]]+groups[gls[2]]+groups[gls[3]],
                gls[2] => groups[gls[1]]+groups[gls[0]]+groups[gls[3]],
                gls[3] => groups[gls[1]]+groups[gls[2]]+groups[gls[0]] }
 
# One regex per group
 
  regexes = Hash.new()
 
# Generate regex string of form ((K|L|M|N|O)(F|G|H|I|J|A|B|C|D|E|P|Q|R|S|T))
 
  groups.keys.each do |g|
    regexes[g] = '((' + groups[g].join('|') + ')(' + notgroups[g].join('|') + '))'
  end
 
# Build the full regex.
 
  big_ol_regex = ''
 
  regexes.keys.each do |r|
    big_ol_regex += regexes[r].to_s.reverse.chomp(r.to_s.reverse).reverse + '|'
  end
  
  big_ol_regex.chomp!('|')
  big_ol_regex += '+?'
  
# Substitute using the first backward match to return the first result.
  puts seq.gsub(Regexp.new(big_ol_regex), '(\0)')
  
# Replace letters with group names to build the second result.
  seq_replaced = seq.gsub(Regexp.new(big_ol_regex)) do |s|
    groups.keys.each do |k|
      s.gsub!(Regexp.new('(' + groups[k].join('|') + ')+?'), k.to_s)
    end
    '(' + s + ')'
  end
  
  puts seq_replaced
end

# Method david_halitsky_challenge expects a sequence of "letters" and a hash
# with 4 keys, each pointing to an array of "letters" found in the sequence.  These
# arrays are the "groups".  The method assumes that the groups do not overlap
# either at the full-letter level or at the sub-letter level for multi-character letters.
#
# david_halitsky_challenge("ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA", 
#                                          { 's' => %w(A B C D E),
#                                            'p' => %w(F G H I J),
#                                            'd' => %w(K L M N O),
#                                            'e' => %w(P Q R S T) })

david_halitsky_challenge()

# Multi-character "letter" test.
david_halitsky_challenge("zAzBzCzDzEzFzGzHzIzJzKzLzMzNzOzPzQzRzSzTzSzRzQzPzOzNzMzLzKzJzIzHzGzFzEzDzCzBzA", 
                                          { 'Ys' => %w(zA zB zC zD zE),
                                            'Yp' => %w(zF zG zH zI zJ),
                                            'Yd' => %w(zK zL zM zN zO),
                                            'Ye' => %w(zP zQ zR zS zT) })

Edited by: Ethan Jewett on Mar 19, 2008 8:27 PM

former_member181923
Active Participant
0 Kudos

Hey Ethan -

Glad you "couldn't resist".

No matter what the technical merits of the regex approach are or are not, I gotta say you've got style and class. "Big_ol_regex" indeed!

I've asked Bill Mann to comment on your code, and if he has the time, I'll post what he has to say.

And I'm sure some of the more savvy scriptors around here will have some thing to say about it as well.

Thanks again for posting your approach.

Dave

esjewett
Active Contributor
0 Kudos

Well David, I couldn't pass this one up, and so my inaugural post in the forums is a snippet of Ruby code

This is set up to run your initial example, and doesn't return the second string you ask for, but I'm out of time. I think it'll probably work on the amino acid examples (multi-character 'letters') that you give, but I haven't tested it. It takes a sequence in the alphabet, an array of group names, and four groups of letters in a hash.

Most importantly, it uses a regular expression to do the dirty work. I felt some obligation to back up all my Twitter talk over the last couple of days and though I'm no expert, this is what I came up with.

Enjoy.


def david_halitsky_challenge(seq = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA", 
                              gls = %w(s p d e), 
                              groups = { 's' => %w(A B C D E),
                                             'p' => %w(F G H I J),
                                             'd' => %w(K L M N O),
                                             'e' => %w(P Q R S T) })

# Arrays of letters *not* in each group

  notgroups = { gls[0] => groups[gls[1]]+groups[gls[2]]+groups[gls[3]],
                gls[1] => groups[gls[0]]+groups[gls[2]]+groups[gls[3]],
                gls[2] => groups[gls[1]]+groups[gls[0]]+groups[gls[3]],
                gls[3] => groups[gls[1]]+groups[gls[2]]+groups[gls[0]] }

# One regex per group

  regexes = Hash.new()

# Generate regex string of form d((K|L|M|N|O)(F|G|H|I|J|A|B|C|D|E|P|Q|R|S|T))

  gls.each do |g|
    regexes[g] = '(('

    groups[g].each do |l|
      regexes[g] += l + '|'
    end

    regexes[g].chomp!('|')
    regexes[g] += ')('

    notgroups[g].each do |l|
      regexes[g] += l + '|'
    end

    regexes[g].chomp!('|')
    regexes[g] += '))'
  end

# Build the full regex.

  big_ol_regex = ''

  regexes.each do |r|
    big_ol_regex += r.to_s.reverse.chop.reverse + '|'
  end
  
  big_ol_regex.chomp!('|')
  big_ol_regex += '+?'
  
# Substitute using the first backward match.
  puts seq.gsub(Regexp.new(big_ol_regex), '(\0)')
end

david_halitsky_challenge()

Edited by: Ethan Jewett on Mar 19, 2008 2:22 AM

former_member181923
Active Participant
0 Kudos

Dan/Craig -

Yeah - Alvaro pretty much has it correct: speed and convenience. But not convenience in the usual sense.

Let's look at speed first.

Imagine you had to do the algorithm on hundreds of millions of strings, many much longer than than the strings in inputs 1 and 2. That's where the speed comes in.

But here's where the "convenience" comes in (in my sense of the word "convenience".). Suppose that in terms of speed, it looks like this (from fastest to slowest)

C/C++

perl

PHP

But suppose your best algorithm-creator actually thinks best "in PHP".

Then for that person, it's more "convenient" to frame a solution in PHP and then let others "translate" that soluition into other languages that generate faster runrimes.

Of couse, there are some who would say it's better to "think-up" the algorithm in language-neutral terms. But in my experience, that's not really the way people work.

Because the nature of each language interacts with the algorithm-creation process in very subtle ways.

Former Member
0 Kudos

I'd probably move PHP much further down the list. I'll finish up my PHP code and post when I get a chance.

former_member181923
Active Participant
0 Kudos

Craig -

I guess I should have typed:

...

C/C++

...

perl

...

PHP

...

I didn't mean to imply that PHP was "third-fastest" - I just didn't know where to fit Ruby, Python, etc, in the ranking

But I'm glad you agree that PHP is toward the bottom.

Dave

former_member181923
Active Participant
0 Kudos

Here's a perl program that Bill Mann wrote to do the same thing as the C program in the last post.

He says it may not be the fastest possible perl (but that's just him being modest!).


#!/usr/bin/perl -w

# perl version of 20let.c5

sub usage {
    print("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
    print("marks amino-acid-pairs from different groups in protein-file\n");
    print("iff they are in the include-file\n");
}

{
    @ARGV != 3 and &usage, exit 1;

#----------------define the groups 
    for (qw(I M V A G)) {
	$G{$_} = 's';
    }
    for (qw(F L P W)) {
	$G{$_} = 'p';
    }
    for (qw(H Q D E)) {
	$G{$_} = 'd';
    }
    for (qw(S T Y N C K R)) {
	$G{$_} = 't';
    }

#----------------the 4 bases
    $bases = '[acgt]';

#---------------- read include-file       (3rd argument)
    open(I, "<$ARGV[2]") or die "can't open include-file $ARGV[2]\n";
    while (defined($_ = <I>)) {
	/^(..) ($bases{6,6})$/io and $E{$2} = $1;
    }

#------------------read amino-acid file   (1st argument)
    open(I, "<$ARGV[0]") or die "can't open include-file $ARGV[0]\n";
    while (defined($_ = <I>)) {
	tr/IMVAGFLPWHQDESTYNCKR//cd;	# delete anything else
	$P .= $_;
    }
    $p = length $P;

#------------------read nucleotide file   (2nd argument)
    open(I, "<$ARGV[1]") or die "can't open include-file $ARGV[1]\n";
    while (defined($_ = <I>)) {
	tr/acgt//cd;			# delete anything else
	$N .= $_;
    }
    length $N >= $p * 3 or
	die "amino-acid file doesn't match nucleotide file\n";

#------------three output lines------------------
    for ($i=0; $i < $p; ++$i) {
	if ($E{substr($N, $i*3, 6)}) {
	    my $a = $G{substr($P, $i, 1)};
	    my $b = $G{substr($P, $i+1, 1)};
	    if (!defined $a || !defined $b) {
		1;
	    }
	    if ($a ne $b)
	{
	    $B .= '(' . substr($P, $i, 2) . ')';
	    $C .= '(' . $G{substr($P, $i, 1)} . $G{substr($P, $i+1, 1)} . ')';
	    $D .= '(' . substr($N, $i*3, 6) . ')';
	    ++$i;
	    next;
	}}

	$B .= substr($P, $i, 1);
	$C .= substr($P, $i, 1);
	$D .= substr($N, $i*3, 3);
    }
    print $B, "\n";
    print $C, "\n";
    print $D, "\n";

#--------------substrings------------

    substrings(20,29);
    substrings(30,39);
    substrings(40,49);
    substrings(50,59);
    substrings(60,69);
}

sub substrings {
    my ($x1, $x2) = @_;
    printf("\nlengths %d - %d : \n", $x1, $x2);
    for ($i = 0; $i < length $C; ++$i) { # using m// and pos() might be faster
	substr($C, $i, 1) =~ /[a-z]/o or next;
	for ($j = $i+$x1; $j < $i+$x2 && $j < length $C; ++$j) {
	    substr($C, $j, 1) =~ /[a-z]/o or next;

	    my $a = my $c = substr($C, $i, $j-$i+1);
	    my $b = substr($B, $i, $j-$i+1);
	    $b =~ tr/()//d;
	    $c =~ tr/a-z//cd;
	    print "$a|$b|$c\n";
	}
    }
}

former_member181923
Active Participant
0 Kudos

OK - here is the final stuff on the "C" side.

To execute the program, the command line is:


20let.exe file1.txt file2.txt file3.txt > fileout.txt

Below, I've provided:

a) source code 20let.c

b) sample input file1.txt

c) sample input file2.txt

d) sample input file3.txt

e) output fileout.txt generated from these input files.

As soon as Bill finishes the perl version of the source code, I'll post that also.


***************
source code of 20let.c
***************
// 20let.c5

#include <stdio.h>
#include <stdlib.h>

int T[333],A[99999],G[333],B[99999],C[99999],N[299999],P[99999];
int n1,n2,f,p,x1,x2,n,m,a,b,c,i,j,k,x,y,z;
int E[233][233];

FILE *file;
int substrings(int x1,int x2);


int main(int argc, char*argv[]) {
    if(argc<3){
        printf("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
        printf("marks amino-acid-pairs from different groups in protein-file\n");
        printf("iff they are in the include-file\n");
        exit(1);
    }


//----------------define the groups        G['I'] = 's', e.g.

    x='s'; G['I']=x;G['M']=x;G['V']=x;G['A']=x;G['G']=x;
    x='p'; G['F']=x;G['L']=x;G['P']=x;G['W']=x;G['W']=x;
    x='d'; G['H']=x;G['Q']=x;G['D']=x;G['E']=x;G['E']=x;
    x='t'; G['S']=x;G['T']=x;G['Y']=x;G['N']=x;G['C']=x;G['K']=x;G['R']=x;

//----------------the 4 bases              T['a'] = 0 thru 3
    for(x=0;x<222;x++)
        T[x]=-999;
    T['a']=0;T['c']=1;T['g']=2;T['t']=3;
    T['A']=0;T['C']=1;T['G']=2;T['T']=3;

/*
  for(i=65;i<70;i++)G<i>='s';
  for(i=70;i<75;i++)G<i>='p';
  for(i=75;i<80;i++)G<i>='d';
  for(i=80;i<85;i++)G<i>='t';
*/




//---------------- read include-file   file3 xxxyyy pairs E[x][y] of interest

    f=0;
    for(x=0;x<222;x++)
        for(y=0;y<222;y++)
            E[x][y]=0;

    if((file=fopen(argv[3],"rb"))==NULL){
        printf("\ncan't open exclude-file %s\n",argv[1]);exit(1);
    }

mq1: if(feof(file))
        goto mq3;
    x=fgetc(file);y=fgetc(file);x=fgetc(file);
    x=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
    y=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
    if(x<64 && x>=0 && y<64 && y>=0){
        E[x][y]=1;
        f++;
    }
mq2: if(feof(file))
        goto mq3;
    a=fgetc(file);
    if(a!=10)
        goto mq2;
    goto mq1;

mq3: fclose(file);


//------------------read amino-acid file    file1 == P array
    if((file=fopen(argv[1],"rb"))==NULL){
        printf("\ncan't open file %s\n",argv[1]);exit(1);}
    p=0;
m1p: if(feof(file))
        goto m2p;
    p++;
    P[p]=fgetc(file);
    if(G[P[p]]==0)
        p--;
    goto m1p;

m2p:;
    fclose(file);


//------------------read nucleotide file    file2 == N array
    if((file=fopen(argv[2],"rb"))==NULL){
        printf("\ncan't open file %s\n",argv[1]);exit(1);
    }
    n=0;
m1n: if(feof(file))
        goto m2n;
    n++;
    N[n]=fgetc(file);
    if(N[n]!='a' && N[n]!='c' && N[n]!='g' && N[n]!='t')
        n--;
    goto m1n;
m2n:;
    fclose(file);


//for(i=1;i<=p;i++)printf("%c",P<i>);printf("\n");
//for(i=1;i<=n;i++)printf("%c",N<i>);printf("\n");
//printf("%i include-pairs  %i nucleotides  %i proteins\n",f,n,p);


//------------1st line------------------       B<i> = result
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];

//printf("\ni=%i p=%i n1=%i n2=%i\n",i,p,n1,n2);

        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c",P<i>);
            m++;
            B[m]=P<i>;
            goto m3;
        }
        printf("(%c%c)",P<i>,P[i+1]);
        i++;
        m++;
        B[m]='(';
        m++;
        B[m]=P[i-1];
        m++;
        B[m]=P<i>;
        m++;
        B[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
    m3:;
    }
    printf("\n");


//------------2nd line------------------       C<i> = result
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];

        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c",P<i>);
            m++;
            C[m]=P<i>;
            goto m4;
        }
        printf("(%c%c)",G[P<i>],G[P[i+1]]);
        i++;
        m++;
        C[m]='(';
        m++;
        C[m]=G[P[i-1]];
        m++;
        C[m]=G[P<i>];
        m++;
        C[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
    m4:;
    }
    printf("\n");



//for(i=1;i<=m;i++)printf("%c",B<i>);printf("\n");



//------------3rd line------------------         printf only
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];
        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c%c%c",N[i*3-2],N[i*3-1],N[i*3]);
            goto m33;
        }
        printf("(%c%c%c%c%c%c)",N[i*3-2],N[i*3-1],N[i*3],N[i*3+1],N[i*3+2],N[i*3+3]);
        i++;
    m33:;
    }
    printf("\n");




//--------------substrings------------

    substrings(20,29);
    substrings(30,39);
    substrings(40,49);
    substrings(50,59);
    substrings(60,69);

    return 0;
}


int substrings(int x1,int x2)
{

    printf("\n");
    printf("lengths %i - %i : \n",x1,x2);
    for(i=1; i<p; i++)
        for (j=i+x1; j<i+x2; j++) {
            if (C<i>>95 && C[j]>95) {   // if lc letter in line2
                for(x=i;x<=j;x++)
                    printf("%c",C[x]);
                printf("|");
                for(x=i;x<=j;x++)
                    if(B[x]>44)         // if not () in line 1
                        printf("%c",B[x]);

                printf("|");
                for(x=i;x<=j;x++)
                    if(C[x]>95)         // if lc letter line2
                        printf("%c",C[x]);

                printf("\n");}
        }
}

******************
input file1.txt
******************
MKKHTDQPIADVQGSPDTRH
IAIDRVGIKAIRHPVLVADK
DGGSQHTVAQFNMYVNLPHN
FKGTHMSRFVEILNSHEREI
SVESFEEILRSMVSRLESDS
GHIEMTFPYFVNKSAPISGV
KSLLDYEVTFIGEIKHGDQY
GFTMKVIVPVTSLCPCSKKI
SDYGAHNQRSHVTISVHTNS
FVWIEDVIRIAEEQASCELF
GLLKRPDEKYVTEKAYNNPK
FVEDIVRDVAEILNHDDRID
AYVVESEBFESIHNHSAYAL
IERD

***********************
input file2.txt
**********************
atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacat
atcgcaattgacagagtcggaatcaaagcgattcgtcacccggttctggtcgccgataag
gatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataat
ttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatt
tcggttgaatcatttgaagaaattttgcgctccatggtcagcaggctggaatcagattcc
ggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgta
aaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatat
gggtttaccatgaaggtgatcgttcctgttaccagcctgtgcccctgctccaagaaaata
tccgattacggtgcgcataaccagcgttcacacgtcaccatttctgtacacactaacagc
ttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttc
ggtctgctgaaacggccggatgaaaaatatgtcacagaaaaggcctataacaatccgaaa
tttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagat
gcctatgttgttgaatcagaaaactttgaatccatacataatcactctgcatacgcactg
atagagcgcgac 

******************
input file3.txt
******************
FA tttgcc
FA ttcgcc
FA tttgct
FA ttcgct
LK ttaaaa
LK ttgaaa
LK ttaaag
LK ttgaag
LS ctgctc
LS ctgctt
LS ctactc
LS ctactt
LT ctcacc
LT ctcact
LT cttacc
LT cttact
LY ctctac
LY ctctat
LY ctttac
LY ctttat
LG ctcggc
LG ctcggt
LG cttggc
LG cttggt
IP attccc
IP attcct
IP atcccc
IP atccct
IP attcca
IP attccg
IP atccca
IP atcccg
ML atgctc
ML atgctt
ML atgctc
ML atgctt
VL gtgctg
VL gtgcta
VL gtactg
VL gtacta
VS gtgtcc
VS gtatct
VS gtgtcc
VS gtatct
VT gtcacc
VT gtcact
VT gttacc
VT gttact
VS gtcagc
VS gtcagt
VS gttagc
VS gttagt
SL tcgctg
SL tcgcta
SL tcactg
SL tcacta
SP tctcca
SP tctccg
SP tcccca
SP tccccg
PV ccggtg
PV ccggta
PV ccagtg
PV ccagta
PG cccggc
PG cccggt
PG cctggc
PG cctggt
TL acgctg
TL acgcta
TL acactg
TL acacta
TP acgccg
TP acgcca
TP acaccg
TP acacca
AL gcttta
AL gctttg
AL gcctta
AL gccttg
AP gcgccg
AP gcgcca
AP gcaccg
AP gcacca
AP gctcca
AP gctccg
AP gcccca
AP gccccg
AN gctaat
AN gctaac
AN gccaat
AN gccaac
AS gccagc
AS gccagt
AS gctagc
AS gctagt
YP tatccg
YP tatcca
YP tacccg
YP taccca
HP catccg
HP catcca
HP cacccg
HP caccca
QR cagcga
QR cagcgg
QR caacga
QR caacgg
DL gatttg
DL gattta
DL gacttg
DL gactta
EN gaaaat
EN gaaaac
EN gagaat
EN gagaac
EK gaaaaa
EK gaaaag
EK gagaaa
EK gagaag
ER gagcga
ER gagcgg
ER gaacga
ER gaacgg
WR tggcga
WR tggcgg
RV cgggtg
RV cgggta
RV cgagtg
RV cgagta
RW cggtgg
RW cgatgg
SG agtgga
SG agtggg
SG agcgga
SG agcggg
GF ggtttt
GF ggtttc
GF ggcttt
GF ggcttc
GL gggctg
GL gggcta
GL ggactg
GL ggacta
GY gggtat
GY gggtac
GY ggatat
GY ggatac
GY ggttat
GY ggttac
GY ggctat
GY ggctac
GK ggaaaa
GK ggaaag
GK gggaaa
GK gggaag
GK ggcaag
GK ggcaaa
GK ggtaag
GK ggtaaa
GW ggctgg
GW ggttgg
GR gggcgg
GR gggcga
GR ggacgg
GR ggacga
GS ggcagc
GS ggcagt
GS ggtagc
GS ggtagt

***********************
output fileout.txt
**********************
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(HP)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(VS)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(VT)SLCPCSKKISDYGAHNQRSH(VT)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(EK)YVT(EK)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(EF)ESIHNHSAYALIERD
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp)ESIHNHSAYALIERD

atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacatatcgcaattgacagagtcggaatcaaagcgattcgt(cacccg)gttctggtcgccgataaggatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataatttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatttcggttgaatcatttgaagaaattttgcgctccatg(gtcagc)aggctggaatcagattccggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgtaaaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatatgggtttaccatgaaggtgatcgttcct(gttacc)agcctgtgcccctgctccaagaaaatatccgattacggtgcgcataaccagcgttcacac(gtcacc)atttctgtacacactaacagcttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttcggtctgctgaaacggccggat(gaaaaa)tatgtcaca(gaaaag)gcctataacaatccgaaatttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagatgcctatgttgttgaatca(gaaaac)tttgaatccatacataatcactctgcatacgcactgatagagcgc

lengths 20 - 29 : 
st)SLCPCSKKISDYGAHNQRSH(s|VTSLCPCSKKISDYGAHNQRSHV|sts
st)SLCPCSKKISDYGAHNQRSH(st|VTSLCPCSKKISDYGAHNQRSHVT|stst
t)SLCPCSKKISDYGAHNQRSH(s|TSLCPCSKKISDYGAHNQRSHV|ts
t)SLCPCSKKISDYGAHNQRSH(st|TSLCPCSKKISDYGAHNQRSHVT|tst

lengths 30 - 39 : 
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|std
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|td
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tdt
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtd
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdp
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|td
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdp

lengths 40 - 49 : 
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|stdtd
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|stdtdt
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tdtd
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|tdtdt
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtdtd
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdtdp
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|tdtd
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdtdp

lengths 50 - 59 : 
t)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(s|SRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPV|ts

lengths 60 - 69 : 
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|dps
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|dpst
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|ps
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|pst
st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st|VSRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPVT|stst
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|ststd
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|ststdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|tstd
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tstdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tstdtd

Edited by: David Halitsky on Mar 18, 2008 4:21 AM

Edited by: David Halitsky on Mar 18, 2008 4:22 AM

former_member10945
Contributor
0 Kudos

As I've read through this I am wondering what you are trying to evaluate. I am sure I can suffice your requirements with just about any language on the planet --- it will just just as ugly as the C code in your example. Are you looking for a scripting language that makes it the most readable? The fastest ( does it have to be interperted or can it be complied )? The easiest to extend, etc. etc.

Language choice is all about using the right tool for the job --- I am missing what else you need it to do besides just work.

If this is simply a mental exercise in how many languages the algorithm can be built it in.... then it's not very difficult to answer --- any of them will do fine.

-d

former_member583013
Active Contributor
0 Kudos

Dan:

Sure, we can use any available language...And God knows there are plenty of them...I guess that David is looking for Speed and ease of development...

As you said...C code is somehow ugly...And Ruby would be cleaner and easier...

Actually in my book "El Arte de Programar" I translated 5 algorithms into 14 Programming languages (Including ABAP, QBasic, Python, Perl, PHP and others)...Just to help people decide which language is easier to learn -;)

Greetings,

Blag.

Former Member
0 Kudos

Curiosity got the better of me as well - so before I pop out the PHP program based on the Perl one (not that different) what's the purpose behind all this again - is like Blag says and just to find out what is fastest?

former_member181923
Active Participant
0 Kudos

Well, if LISP has gotten in here, then we might as well consider MONK.

MONK is/was an MIT knock-off of LISP that SeeBeyond (now Sun) used to use to build ETD's, before they were forced to go to JAVA.

So I guess you could say that MONK was SeeBeyond's ABAP.

heh heh heh

former_member10945
Contributor
0 Kudos

Although I am wholefully inept at the class of languages, most tail recursive functional languages would be best at solving this problem. This is really just a parsing problem like any other in the CS world. The most efficient languages in terms of expression for parsers is almost always functional languages. You can make Python look like a functional language but -- it isn't the way it's speediest. I might try and bust out my Scheme book this weekend and have a crack at this.

Functional Languages of Note:

Scheme, Lisp, Erlang.

-d

former_member181923
Active Participant
0 Kudos

Hi Dan -

Thanks very much for weighing in.

If you're going to try it in scheme, then as I said to Alvaro, please wait till tomorrow - I want to post a fuller version of Gunter's C code with three inputs and more outputs.

Also, I will probably have some a really neat perl version posted as well - if Bill Mann has the time to write it tonight (like 3.5 minutes for him.)

Thanks again

djh

Former Member
0 Kudos

I was thinking Lisp myself, I once did a brute force in Lisp that brought down the whole network - someone of importance looked a file and no one could get in because they forgot to change the "random generated password" - some tweaking and Lisp pulled it off though.

former_member583013
Active Contributor
0 Kudos

Lisp? Sounds cool...Maybe I can Euphoria to the bag of goodies...Flex anyone?

Greetings,

Blag.

former_member583013
Active Contributor
0 Kudos

David:

I already know C and C++....So it's not that hard for me to understand the code...Craig already offered you to make an PHP version....So if you want I can try to make a Ruby version -;) Ok...I want to do it anyway...So just tell me if you want me to the send the code to you -:D

Greetings,

Blag.

former_member181923
Active Participant
0 Kudos

Alvaro -

Yes - I would love to see a Ruby version.

But please wait till tomorrow - I want to send you a fuller version of the C program that takes three files as input and prints some additional kinds of output.

Then, if you do the PHP, we will have C, perl, Ruby, and PHP versions of the program, and we can probably have an interesting discussion of what's "better" to use when calling the program from SAP via RFC.

Thanks very much again.

Best

djh

former_member583013
Active Contributor
0 Kudos

David:

Sure, I will wait for your files -:)

Sadly my Python skill are very primitive...Hope someone else can help us with that...

Greetings,

Blag.

Former Member
0 Kudos

The question should not be if the can do it or not (they can) it's a matter of how efficient they would be or which would be faster.

Trying to wrap my head around how I would do it in Perl (been awhile) if you've got the code I'll do a PHP version if you like for comparison sake - PHP like Perl can run browser based or command line based. I have a feeling Python might run quicker but that's just a hunch.

former_member181923
Active Participant
0 Kudos

Hi Craig -

Thanks for taking the time to reply.

Here's a C program that does a little more than the spec. It was written by a fellow named Gunter Sterten over on your side of the pond ... in Germany.

Following the program is a sample of input and output. (I've truncated the output because it's quite large.)

#include <stdio.h>
int A[99999],G[333],B[99999],C[99999];
int x1,x2,n,m,a,b,c,i,j,k,x,y,z;
 
FILE *file;
 
 
 
int main(int argc,char*argv[]){
  if(argc<2){printf("\nusage:20let file\n\n");
             printf("marks pairs from different groups in file\n");
       exit(1);}
 

x='s';G['I']=x;G['M']=x;G['V']=x;G['A']=x;G['G']=x;
x='p';G['F']=x;G['L']=x;G['P']=x;G['W']=x;G['W']=x;
x='d';G['H']=x;G['Q']=x;G['D']=x;G['E']=x;G['E']=x;
x='t';G['S']=x;G['T']=x;G['Y']=x;G['N']=x;G['C']=x;G['K']=x;G['R']=x;
 
/*
for(i=65;i<70;i++)G<i>='s';
for(i=70;i<75;i++)G<i>='p';
for(i=75;i<80;i++)G<i>='d';
for(i=80;i<85;i++)G<i>='t';
*/
 

  if((file=fopen(argv[1],"rb"))==NULL){printf("\ncan't open file %s\n",argv[1]);exit(1);}
 
 n=0;
m1:if(feof(file))goto m2;
   n++;A[n]=fgetc(file);if(G[A[n]]==0)n--;
   goto m1;
m2:;
 
//for(i=1;i<=n;i++)printf("%c",A<i>);printf("\n");
//for(i=1;i<=n;i++)printf("%c",G[A<i>]);printf("\n");
 
m=0;for(i=1;i<=n;i++){
if(G[A<i>]==G[A[i+1]] || i==n){printf("%c",A<i>);m++;B[m]=A<i>;goto m3;}
printf("(%c%c)",A<i>,A[i+1]);i++;
m++;B[m]='(';m++;B[m]=A[i-1];m++;B[m]=A<i>;m++;B[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
 
m3:;}printf("\n");
 

m=0;for(i=1;i<=n;i++){
//printf("i=%i A<i>=%i\n",i,A<i>);
if(G[A<i>]==G[A[i+1]] || i==n){printf("%c",A<i>);m++;C[m]=A<i>;goto m4;}
printf("(%c%c)",G[A<i>],G[A[i+1]]);i++;
m++;C[m]='(';m++;C[m]=G[A[i-1]];m++;C[m]=G[A<i>];m++;C[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
 
m4:;}printf("\n");
 
 
 
//for(i=1;i<=m;i++)printf("%c",B<i>);printf("\n");
 
 
 
printf("\n");x1=20;x2=29;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
  if(C<i>>95 && C[j]>95){
      for(x=i;x<=j;x++)printf("%c",C[x]);
      printf("|");
      for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
      printf("\n");}
}
 
printf("\n");x1=30;x2=39;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
  if(C<i>>95 && C[j]>95){
      for(x=i;x<=j;x++)printf("%c",C[x]);
      printf("|");
      for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
      printf("\n");}
}
 
printf("\n");x1=40;x2=49;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
  if(C<i>>95 && C[j]>95){
      for(x=i;x<=j;x++)printf("%c",C[x]);
      printf("|");
      for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
      printf("\n");}
}
 
printf("\n");x1=50;x2=59;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
  if(C<i>>95 && C[j]>95){
      for(x=i;x<=j;x++)printf("%c",C[x]);
      printf("|");
      for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
      printf("\n");}
}
 
printf("\n");x1=60;x2=69;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
  if(C<i>>95 && C[j]>95){
      for(x=i;x<=j;x++)printf("%c",C[x]);
      printf("|");
      for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
      printf("\n");}
}
 

}

Input:

MNKQIDLPIADVQGSLDTRHIAIDRVGIKAIRHPVVVADKGGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVSRLESDSGHIEMAFPYFINKSAPVSGVKSLLDYEVTFIGEIKHGNQYSFTMKVIVPVTSLCPCSKKISDYGAHNQRSHVTISVRTNSFIWIEDIIRIAEEQASCELYGLLKRPDEKYVTERAYNNPKFVEDIVRDVAEVLNHDDRIDAYIVESENFESIHNHSAYALIERDKRIR

Outputs (in same output file):

(MN)(KQ)(ID)L(PI)(AD)(VQ)(GS)(LD)T(RH)IA(ID)(RV)G(IK)A(IR)(HP)VVV(AD)(KG)G(GS)Q(HT)V(AQ)(FN)(MY)(VN)L(PH)(NF)(KG)(TH)(MS)(RF)(VE)(IL)N(SH)(ER)(EI)(SV)(ES)(FE)(EI)(LR)(SM)(VS)(RL)(ES)(DS)(GH)(IE)M(AF)(PY)(FI)NK(SA)(PV)(SG)(VK)(SL)(LD)(YE)(VT)(FI)(GE)(IK)(HG)(NQ)Y(SF)(TM)(KV)I(VP)(VT)(SL)(CP)CSK(KI)(SD)(YG)(AH)(NQ)R(SH)(VT)(IS)(VR)TN(SF)(IW)(IE)(DI)(IR)I(AE)E(QA)S(CE)(LY)(GL)(LK)(RP)D(EK)(YV)(TE)(RA)YN(NP)(KF)(VE)(DI)(VR)(DV)(AE)(VL)(NH)D(DR)(ID)(AY)I(VE)(SE)(NF)(ES)(IH)(NH)(SA)(YA)(LI)(ER)(DK)(RI)R

(st)(td)(sd)L(ps)(sd)(sd)(st)(pd)T(td)IA(sd)(ts)G(st)A(st)(dp)VVV(sd)(ts)G(st)Q(dt)V(sd)(pt)(st)(st)L(pd)(tp)(ts)(td)(st)(tp)(sd)(sp)N(td)(dt)(ds)(ts)(dt)(pd)(ds)(pt)(ts)(st)(tp)(dt)(dt)(sd)(sd)M(sp)(pt)(ps)NK(ts)(ps)(ts)(st)(tp)(pd)(td)(st)(ps)(sd)(st)(ds)(td)Y(tp)(ts)(ts)I(sp)(st)(tp)(tp)CSK(ts)(td)(ts)(sd)(td)R(td)(st)(st)(st)TN(tp)(sp)(sd)(ds)(st)I(sd)E(ds)S(td)(pt)(sp)(pt)(tp)D(dt)(ts)(td)(ts)YN(tp)(tp)(sd)(ds)(st)(ds)(sd)(sp)(td)D(dt)(sd)(st)I(sd)(td)(tp)(dt)(sd)(td)(ts)(ts)(ps)(dt)(dt)(ts)R

st)(td)(sd)L(ps)(sd)(s|MNKQIDLPIADV

st)(td)(sd)L(ps)(sd)(sd|MNKQIDLPIADVQ

st)(td)(sd)L(ps)(sd)(sd)(s|MNKQIDLPIADVQG

st)(td)(sd)L(ps)(sd)(sd)(st|MNKQIDLPIADVQGS

t)(td)(sd)L(ps)(sd)(s|NKQIDLPIADV

t)(td)(sd)L(ps)(sd)(sd|NKQIDLPIADVQ

t)(td)(sd)L(ps)(sd)(sd)(s|NKQIDLPIADVQG

t)(td)(sd)L(ps)(sd)(sd)(st|NKQIDLPIADVQGS

t)(td)(sd)L(ps)(sd)(sd)(st)(p|NKQIDLPIADVQGSL

td)(sd)L(ps)(sd)(sd)(s|KQIDLPIADVQG

td)(sd)L(ps)(sd)(sd)(st|KQIDLPIADVQGS

td)(sd)L(ps)(sd)(sd)(st)(p|KQIDLPIADVQGSL

td)(sd)L(ps)(sd)(sd)(st)(pd|KQIDLPIADVQGSLD

d)(sd)L(ps)(sd)(sd)(s|QIDLPIADVQG

d)(sd)L(ps)(sd)(sd)(st|QIDLPIADVQGS

d)(sd)L(ps)(sd)(sd)(st)(p|QIDLPIADVQGSL

d)(sd)L(ps)(sd)(sd)(st)(pd|QIDLPIADVQGSLD

sd)L(ps)(sd)(sd)(st)(p|IDLPIADVQGSL

sd)L(ps)(sd)(sd)(st)(pd|IDLPIADVQGSLD

sd)L(ps)(sd)(sd)(st)(pd)T(t|IDLPIADVQGSLDTR

sd)L(ps)(sd)(sd)(st)(pd)T(td|IDLPIADVQGSLDTRH

Edited by: Craig Cmehil on Mar 15, 2008 10:41 AM (applied "code" format for better reading)

Former Member
0 Kudos

No perl code? you're making me work here, Perl is so much easier to convert to PHP - C I've now got to think, and for just curiosity I can think of a few more interesting things to think about over the weekend

former_member181923
Active Participant
0 Kudos

Craig -

I think I can get the perl for this very quickly ... so please hold off and keep an eye out for a post with the perl in a couple of days.

Please see also my note to Alvaro ... I want to send a fuller version of the program that will be better for teaching you guys about the basics of bioinformatics.

Thanks very much again.

djh