perl versus other "scripting" languages when doing...

Former Member · ‎03-14-2008

I've been told that perl is a "scripting" language like the other languages mentioned in this forum.

If that's true, can these other languages handle the following spec as well as perl can? (See spec at end of this post.)

Or is perl stronger in string operations than the other scripting languages mentioned here?

Here's the spec:

1. I give your program a twenty-letter alphabet (any twenty letter alphabet)

For example:

ABCDEFGHIJKLMNOPQRST

2. I also give your program four groups (any four groups) of letters in this alphabet:

For example:

s: A,B,C,D,E

p: F,G,H,I,J

d: K,L,M,N,O

e: P,Q,R,S,T

3. I also give your program a sequence over the twenty-letter alphabet that I gave you in Step (1) above:

For example:

ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA

4. Given this sequence,you search for pairs of adjacent letters (x,y) where X and y are from different groups (the groups defined in Step (2) above.)

Also, you return the results of this search by giving me back the following two strings:

ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA

ABCD(sp)GHI(pd)LMN(de)QRSTSRQ(ed)NML(dp)IHG(ps)DCBA

5. Note: if I give you a sequence that contains "overlapping" ordered pairs like:

...EFK...

then you ignore the second ordered pair. That is, you return:

...(EF)K

former_member181923 · ‎10-01-2008

closing question to get below 10

former_member181923 · ‎03-23-2008

At Craig's suggestion, I've set up a WIKI page for this problem here:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticCodingProblem+1

Contributors to this thread should feel free to post their solutions as child-pages of the above page.

Contributors of multiple solutions in different languages should post each solution on a different child-page.

I will post Bill's perl and Gunter's C as solutions to Problem 2, since their programs do more than what was asked for in the original spec given in the top-post of this thread.

Former Member · ‎03-22-2008

just wanted to add the JAVASCRIPT version, but this forum doesn't let me post it. "METHOD NOT IMPLEMENTED" it says. hmmm. no idea what that means.

anton

Former Member · ‎03-21-2008

Java version!


/**
 * 
 * @author Gregor Brett
 *
 */

public class djh
{
	public static void main(String[] args)
	{
		String	 a = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA";
		String[] s = {"A","B","C","D","E"};
		String[] p = {"F","G","H","I","J"};
		String[] d = {"K","L","M","N","O"};
		String[] e = {"P","Q","R","S","T"};
		String[] names = {"s","p","d","e"};
		String[][] groups = {s,p,d,e};
		String[] pairs = new String[(groups.length)*((groups[0].length*groups[0].length)*(groups.length-1))];
		String[] pairs_codes = new String[pairs.length];
		int count = 0;
		
		for(int i=0;i<groups.length;i++)
		{
			for(int n=0;n<groups<i>.length;n++)
			{
			   for(int m=0;m<groups.length;m++)
			   {
			   	   if(i != m)
			   	   {
					   for(int l=0;l<groups[m].length;l++)
					   {
					      pairs[count] = groups<i>[n] + groups[m][l];
					      pairs_codes[count] = names<i> + names[m];
						  count++;
					   }
			   	   }
			   }
			}
		}
		String ai = a;
		for(int i=0;i<pairs.length;i++)
		{
			a = a.replaceFirst(pairs<i>, "("+pairs<i>+")");
			ai = ai.replaceFirst(pairs<i>, "("+ pairs_codes<i> +")");
		}
		System.out.println(a);
		System.out.println(ai);		
	}
}

former_member181923 · ‎03-21-2008

I just sent Craig the files (see copy of email below.)

Note to Anton: heh heh heh ... I like your style!

Email to Craig:

Craig -

In case anyone wants to try the substring routine and the parenthesization of the nucleotide string, the attached zipfile has:

1) Gunter's c code: 20let.c;

2) Gunter's c exe: 20let.exe

3) Bill's latest perl: 20let-re.pl

4) three input files:

val1a.txt

val2a.txt

val3b.txt

5) output file: fileout.txt

On to the WIKI-page!!!!

Thanks very much again. You're being very kind.

Best

djh

Former Member · ‎03-21-2008

here's a solution using the exact same algorithm like the one i posted earlier. only the language used this time is a bit 'chattier' )

(or my programming skill in this language is).

looking forward to see if anyone guesses the language used.


*&---------------------------------------------------------------------*
*& Report  ZTW_REGEX1
*&
*&---------------------------------------------------------------------*
*& created for djh challenge; ACW210308
*&
*&---------------------------------------------------------------------*

REPORT  ZTW_REGEX1.

data: l_target    type string,
      l_targes    type string,
      lt_group    type table of string,
      ll_group    type string,
      ll_grouq    type string,
      lt_groupid  type table of string,
      ll_groupid  type string,
      ll_grouqid  type string,
      l_ind       type i,
      l_ine       type i,
      l_pattern   type string,
      l_replace   type string.

l_target = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.
l_targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.

append 'ABCDE' to lt_group. append 's' to lt_groupid.
append 'FGHIJ' to lt_group. append 'p' to lt_groupid.
append 'KLMNO' to lt_group. append 'd' to lt_groupid.
append 'PQRST' to lt_group. append 'e' to lt_groupid.

loop at lt_group into ll_group.
  l_ind = sy-tabix.
  loop at lt_group into ll_grouq.
    l_ine = sy-tabix.
    if l_ind <> l_ine.
      read table lt_group index l_ind into ll_group.
      read table lt_group index l_ine into ll_grouq.
      read table lt_groupid index l_ind into ll_groupid.
      read table lt_groupid index l_ine into ll_grouqid.
      concatenate '([' ll_group '][' ll_grouq '])' into l_pattern.
      replace regex l_pattern in l_target with '($1)'.
      concatenate '(' ll_groupid ll_grouqid ')' into l_replace.
      replace regex l_pattern in l_targes with l_replace.
    endif.
  endloop.
endloop.

write: / l_target.
write: / l_targes.

anton

former_member181923 · ‎03-20-2008

In Bill's perl program posted above, he put a comment in the "substrings" routine to indicate that he had a faster version in mind.

Here is his "regex" recoding of the "substrings" routine. He's pretty sure it will run faster than the original.


sub substrings {
    my ($x1, $x2) = @_;
    printf("\nlengths %d - %d : \n", $x1, $x2);
    pos($C) = 0;
    while ($C =~ /[spdt]/g) {
	pos($C)-1+$x1 < length $C or last;
	my $j = substr($C, pos($C)-1+$x1, $x2-$x1);
	while ($j =~ /[spdt]/g) {

	    my $a = my $c = substr($C, pos($C)-1, pos($j)+$x1);
	    my $b = substr($B, pos($C)-1, pos($j)+$x1);
	    $b =~ tr/()//d;
	    $c =~ tr/a-z//cd;
	    print "$a|$b|$c\n";
	}
    }
}

Former Member · ‎03-19-2008

a quick PHP solution:


<?
// djh challenge

$target = $targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA';
$group[1] = 'ABCDE'; $groupid[1] = "s";
$group[2] = 'FGHJI'; $groupid[2] = "p";
$group[3] = 'KLMNO'; $groupid[3] = "d";
$group[4] = 'PQRST'; $groupid[4] = "e";


for($i=1; $i < count($group)+1; $i++) {
  for($j=1; $j < count($group)+1; $j++) {
    if ($i <> $j) {
		$target = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '($1)', $target);
		$targes = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '('.$groupid[$i].$groupid[$j].')', $targes);		
    }
  }
}
echo $target. "\n" . $targes;

//fulfills 5. & yields
//ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
//QED.
?>

regards, anton

former_member181923 · ‎03-19-2008

Here are some comments from Bill on the situation as he sees it:


I found the ruby code and ran it on my Linux box.  It takes no arguments and prints out:
mannb:dh $ ruby 20let.ruby 
ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
Since I don't know ruby, and it doesn't do the same things, it's hard to evaluate.  Certainly I'd need to look at a ruby manual to understand it.  The regular expression is used to locate the stuff in (), but as Ethan said, it doesn't try to find the substrings, at least not yet.
I translated the C code to perl in 2-3 hours.  I didn't try to figure out the best possible algorithm, and I tried to make the programs parallel so they would be easy to compare.   The C program is not optimized for speed, space, or style.   My perl program is fairly simple if you can read Perl regular expressions, an d understand the options of tr/// I'm using.
Python uses what is basically the same regular expression subroutine package as perl.  Ruby seems different.
We used perl because I like it and it was flexible and speedy enough for most things.  If not, C's my choice.

former_member181923 · ‎03-19-2008

Hey Ethan -

Glad you "couldn't resist".

No matter what the technical merits of the regex approach are or are not, I gotta say you've got style and class. "Big_ol_regex" indeed!

I've asked Bill Mann to comment on your code, and if he has the time, I'll post what he has to say.

And I'm sure some of the more savvy scriptors around here will have some thing to say about it as well.

Thanks again for posting your approach.

Dave

esjewett · ‎03-19-2008

Well David, I couldn't pass this one up, and so my inaugural post in the forums is a snippet of Ruby code

This is set up to run your initial example, and doesn't return the second string you ask for, but I'm out of time. I think it'll probably work on the amino acid examples (multi-character 'letters') that you give, but I haven't tested it. It takes a sequence in the alphabet, an array of group names, and four groups of letters in a hash.

Most importantly, it uses a regular expression to do the dirty work. I felt some obligation to back up all my Twitter talk over the last couple of days and though I'm no expert, this is what I came up with.

Enjoy.


def david_halitsky_challenge(seq = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA", 
                              gls = %w(s p d e), 
                              groups = { 's' => %w(A B C D E),
                                             'p' => %w(F G H I J),
                                             'd' => %w(K L M N O),
                                             'e' => %w(P Q R S T) })

# Arrays of letters *not* in each group

  notgroups = { gls[0] => groups[gls[1]]+groups[gls[2]]+groups[gls[3]],
                gls[1] => groups[gls[0]]+groups[gls[2]]+groups[gls[3]],
                gls[2] => groups[gls[1]]+groups[gls[0]]+groups[gls[3]],
                gls[3] => groups[gls[1]]+groups[gls[2]]+groups[gls[0]] }

# One regex per group

  regexes = Hash.new()

# Generate regex string of form d((K|L|M|N|O)(F|G|H|I|J|A|B|C|D|E|P|Q|R|S|T))

  gls.each do |g|
    regexes[g] = '(('

    groups[g].each do |l|
      regexes[g] += l + '|'
    end

    regexes[g].chomp!('|')
    regexes[g] += ')('

    notgroups[g].each do |l|
      regexes[g] += l + '|'
    end

    regexes[g].chomp!('|')
    regexes[g] += '))'
  end

# Build the full regex.

  big_ol_regex = ''

  regexes.each do |r|
    big_ol_regex += r.to_s.reverse.chop.reverse + '|'
  end
  
  big_ol_regex.chomp!('|')
  big_ol_regex += '+?'
  
# Substitute using the first backward match.
  puts seq.gsub(Regexp.new(big_ol_regex), '(\0)')
end

david_halitsky_challenge()

Edited by: Ethan Jewett on Mar 19, 2008 2:22 AM

former_member181923 · ‎03-18-2008

Dan/Craig -

Yeah - Alvaro pretty much has it correct: speed and convenience. But not convenience in the usual sense.

Let's look at speed first.

Imagine you had to do the algorithm on hundreds of millions of strings, many much longer than than the strings in inputs 1 and 2. That's where the speed comes in.

But here's where the "convenience" comes in (in my sense of the word "convenience".). Suppose that in terms of speed, it looks like this (from fastest to slowest)

C/C++

perl

PHP

But suppose your best algorithm-creator actually thinks best "in PHP".

Then for that person, it's more "convenient" to frame a solution in PHP and then let others "translate" that soluition into other languages that generate faster runrimes.

Of couse, there are some who would say it's better to "think-up" the algorithm in language-neutral terms. But in my experience, that's not really the way people work.

Because the nature of each language interacts with the algorithm-creation process in very subtle ways.

former_member181923 · ‎03-18-2008

Here's a perl program that Bill Mann wrote to do the same thing as the C program in the last post.

He says it may not be the fastest possible perl (but that's just him being modest!).


#!/usr/bin/perl -w

# perl version of 20let.c5

sub usage {
    print("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
    print("marks amino-acid-pairs from different groups in protein-file\n");
    print("iff they are in the include-file\n");
}

{
    @ARGV != 3 and &usage, exit 1;

#----------------define the groups 
    for (qw(I M V A G)) {
	$G{$_} = 's';
    }
    for (qw(F L P W)) {
	$G{$_} = 'p';
    }
    for (qw(H Q D E)) {
	$G{$_} = 'd';
    }
    for (qw(S T Y N C K R)) {
	$G{$_} = 't';
    }

#----------------the 4 bases
    $bases = '[acgt]';

#---------------- read include-file       (3rd argument)
    open(I, "<$ARGV[2]") or die "can't open include-file $ARGV[2]\n";
    while (defined($_ = <I>)) {
	/^(..) ($bases{6,6})$/io and $E{$2} = $1;
    }

#------------------read amino-acid file   (1st argument)
    open(I, "<$ARGV[0]") or die "can't open include-file $ARGV[0]\n";
    while (defined($_ = <I>)) {
	tr/IMVAGFLPWHQDESTYNCKR//cd;	# delete anything else
	$P .= $_;
    }
    $p = length $P;

#------------------read nucleotide file   (2nd argument)
    open(I, "<$ARGV[1]") or die "can't open include-file $ARGV[1]\n";
    while (defined($_ = <I>)) {
	tr/acgt//cd;			# delete anything else
	$N .= $_;
    }
    length $N >= $p * 3 or
	die "amino-acid file doesn't match nucleotide file\n";

#------------three output lines------------------
    for ($i=0; $i < $p; ++$i) {
	if ($E{substr($N, $i*3, 6)}) {
	    my $a = $G{substr($P, $i, 1)};
	    my $b = $G{substr($P, $i+1, 1)};
	    if (!defined $a || !defined $b) {
		1;
	    }
	    if ($a ne $b)
	{
	    $B .= '(' . substr($P, $i, 2) . ')';
	    $C .= '(' . $G{substr($P, $i, 1)} . $G{substr($P, $i+1, 1)} . ')';
	    $D .= '(' . substr($N, $i*3, 6) . ')';
	    ++$i;
	    next;
	}}

	$B .= substr($P, $i, 1);
	$C .= substr($P, $i, 1);
	$D .= substr($N, $i*3, 3);
    }
    print $B, "\n";
    print $C, "\n";
    print $D, "\n";

#--------------substrings------------

    substrings(20,29);
    substrings(30,39);
    substrings(40,49);
    substrings(50,59);
    substrings(60,69);
}

sub substrings {
    my ($x1, $x2) = @_;
    printf("\nlengths %d - %d : \n", $x1, $x2);
    for ($i = 0; $i < length $C; ++$i) { # using m// and pos() might be faster
	substr($C, $i, 1) =~ /[a-z]/o or next;
	for ($j = $i+$x1; $j < $i+$x2 && $j < length $C; ++$j) {
	    substr($C, $j, 1) =~ /[a-z]/o or next;

	    my $a = my $c = substr($C, $i, $j-$i+1);
	    my $b = substr($B, $i, $j-$i+1);
	    $b =~ tr/()//d;
	    $c =~ tr/a-z//cd;
	    print "$a|$b|$c\n";
	}
    }
}

former_member181923 · ‎03-18-2008

OK - here is the final stuff on the "C" side.

To execute the program, the command line is:


20let.exe file1.txt file2.txt file3.txt > fileout.txt

Below, I've provided:

a) source code 20let.c

b) sample input file1.txt

c) sample input file2.txt

d) sample input file3.txt

e) output fileout.txt generated from these input files.

As soon as Bill finishes the perl version of the source code, I'll post that also.


***************
source code of 20let.c
***************
// 20let.c5

#include <stdio.h>
#include <stdlib.h>

int T[333],A[99999],G[333],B[99999],C[99999],N[299999],P[99999];
int n1,n2,f,p,x1,x2,n,m,a,b,c,i,j,k,x,y,z;
int E[233][233];

FILE *file;
int substrings(int x1,int x2);


int main(int argc, char*argv[]) {
    if(argc<3){
        printf("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
        printf("marks amino-acid-pairs from different groups in protein-file\n");
        printf("iff they are in the include-file\n");
        exit(1);
    }


//----------------define the groups        G['I'] = 's', e.g.

    x='s'; G['I']=x;G['M']=x;G['V']=x;G['A']=x;G['G']=x;
    x='p'; G['F']=x;G['L']=x;G['P']=x;G['W']=x;G['W']=x;
    x='d'; G['H']=x;G['Q']=x;G['D']=x;G['E']=x;G['E']=x;
    x='t'; G['S']=x;G['T']=x;G['Y']=x;G['N']=x;G['C']=x;G['K']=x;G['R']=x;

//----------------the 4 bases              T['a'] = 0 thru 3
    for(x=0;x<222;x++)
        T[x]=-999;
    T['a']=0;T['c']=1;T['g']=2;T['t']=3;
    T['A']=0;T['C']=1;T['G']=2;T['T']=3;

/*
  for(i=65;i<70;i++)G<i>='s';
  for(i=70;i<75;i++)G<i>='p';
  for(i=75;i<80;i++)G<i>='d';
  for(i=80;i<85;i++)G<i>='t';
*/




//---------------- read include-file   file3 xxxyyy pairs E[x][y] of interest

    f=0;
    for(x=0;x<222;x++)
        for(y=0;y<222;y++)
            E[x][y]=0;

    if((file=fopen(argv[3],"rb"))==NULL){
        printf("\ncan't open exclude-file %s\n",argv[1]);exit(1);
    }

mq1: if(feof(file))
        goto mq3;
    x=fgetc(file);y=fgetc(file);x=fgetc(file);
    x=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
    y=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
    if(x<64 && x>=0 && y<64 && y>=0){
        E[x][y]=1;
        f++;
    }
mq2: if(feof(file))
        goto mq3;
    a=fgetc(file);
    if(a!=10)
        goto mq2;
    goto mq1;

mq3: fclose(file);


//------------------read amino-acid file    file1 == P array
    if((file=fopen(argv[1],"rb"))==NULL){
        printf("\ncan't open file %s\n",argv[1]);exit(1);}
    p=0;
m1p: if(feof(file))
        goto m2p;
    p++;
    P[p]=fgetc(file);
    if(G[P[p]]==0)
        p--;
    goto m1p;

m2p:;
    fclose(file);


//------------------read nucleotide file    file2 == N array
    if((file=fopen(argv[2],"rb"))==NULL){
        printf("\ncan't open file %s\n",argv[1]);exit(1);
    }
    n=0;
m1n: if(feof(file))
        goto m2n;
    n++;
    N[n]=fgetc(file);
    if(N[n]!='a' && N[n]!='c' && N[n]!='g' && N[n]!='t')
        n--;
    goto m1n;
m2n:;
    fclose(file);


//for(i=1;i<=p;i++)printf("%c",P<i>);printf("\n");
//for(i=1;i<=n;i++)printf("%c",N<i>);printf("\n");
//printf("%i include-pairs  %i nucleotides  %i proteins\n",f,n,p);


//------------1st line------------------       B<i> = result
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];

//printf("\ni=%i p=%i n1=%i n2=%i\n",i,p,n1,n2);

        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c",P<i>);
            m++;
            B[m]=P<i>;
            goto m3;
        }
        printf("(%c%c)",P<i>,P[i+1]);
        i++;
        m++;
        B[m]='(';
        m++;
        B[m]=P[i-1];
        m++;
        B[m]=P<i>;
        m++;
        B[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
    m3:;
    }
    printf("\n");


//------------2nd line------------------       C<i> = result
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];

        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c",P<i>);
            m++;
            C[m]=P<i>;
            goto m4;
        }
        printf("(%c%c)",G[P<i>],G[P[i+1]]);
        i++;
        m++;
        C[m]='(';
        m++;
        C[m]=G[P[i-1]];
        m++;
        C[m]=G[P<i>];
        m++;
        C[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
    m4:;
    }
    printf("\n");



//for(i=1;i<=m;i++)printf("%c",B<i>);printf("\n");



//------------3rd line------------------         printf only
    m=0;
    for(i=1;i<=p;i++){
        n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
        n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];
        if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
            printf("%c%c%c",N[i*3-2],N[i*3-1],N[i*3]);
            goto m33;
        }
        printf("(%c%c%c%c%c%c)",N[i*3-2],N[i*3-1],N[i*3],N[i*3+1],N[i*3+2],N[i*3+3]);
        i++;
    m33:;
    }
    printf("\n");




//--------------substrings------------

    substrings(20,29);
    substrings(30,39);
    substrings(40,49);
    substrings(50,59);
    substrings(60,69);

    return 0;
}


int substrings(int x1,int x2)
{

    printf("\n");
    printf("lengths %i - %i : \n",x1,x2);
    for(i=1; i<p; i++)
        for (j=i+x1; j<i+x2; j++) {
            if (C<i>>95 && C[j]>95) {   // if lc letter in line2
                for(x=i;x<=j;x++)
                    printf("%c",C[x]);
                printf("|");
                for(x=i;x<=j;x++)
                    if(B[x]>44)         // if not () in line 1
                        printf("%c",B[x]);

                printf("|");
                for(x=i;x<=j;x++)
                    if(C[x]>95)         // if lc letter line2
                        printf("%c",C[x]);

                printf("\n");}
        }
}

******************
input file1.txt
******************
MKKHTDQPIADVQGSPDTRH
IAIDRVGIKAIRHPVLVADK
DGGSQHTVAQFNMYVNLPHN
FKGTHMSRFVEILNSHEREI
SVESFEEILRSMVSRLESDS
GHIEMTFPYFVNKSAPISGV
KSLLDYEVTFIGEIKHGDQY
GFTMKVIVPVTSLCPCSKKI
SDYGAHNQRSHVTISVHTNS
FVWIEDVIRIAEEQASCELF
GLLKRPDEKYVTEKAYNNPK
FVEDIVRDVAEILNHDDRID
AYVVESEBFESIHNHSAYAL
IERD

***********************
input file2.txt
**********************
atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacat
atcgcaattgacagagtcggaatcaaagcgattcgtcacccggttctggtcgccgataag
gatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataat
ttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatt
tcggttgaatcatttgaagaaattttgcgctccatggtcagcaggctggaatcagattcc
ggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgta
aaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatat
gggtttaccatgaaggtgatcgttcctgttaccagcctgtgcccctgctccaagaaaata
tccgattacggtgcgcataaccagcgttcacacgtcaccatttctgtacacactaacagc
ttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttc
ggtctgctgaaacggccggatgaaaaatatgtcacagaaaaggcctataacaatccgaaa
tttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagat
gcctatgttgttgaatcagaaaactttgaatccatacataatcactctgcatacgcactg
atagagcgcgac 

******************
input file3.txt
******************
FA tttgcc
FA ttcgcc
FA tttgct
FA ttcgct
LK ttaaaa
LK ttgaaa
LK ttaaag
LK ttgaag
LS ctgctc
LS ctgctt
LS ctactc
LS ctactt
LT ctcacc
LT ctcact
LT cttacc
LT cttact
LY ctctac
LY ctctat
LY ctttac
LY ctttat
LG ctcggc
LG ctcggt
LG cttggc
LG cttggt
IP attccc
IP attcct
IP atcccc
IP atccct
IP attcca
IP attccg
IP atccca
IP atcccg
ML atgctc
ML atgctt
ML atgctc
ML atgctt
VL gtgctg
VL gtgcta
VL gtactg
VL gtacta
VS gtgtcc
VS gtatct
VS gtgtcc
VS gtatct
VT gtcacc
VT gtcact
VT gttacc
VT gttact
VS gtcagc
VS gtcagt
VS gttagc
VS gttagt
SL tcgctg
SL tcgcta
SL tcactg
SL tcacta
SP tctcca
SP tctccg
SP tcccca
SP tccccg
PV ccggtg
PV ccggta
PV ccagtg
PV ccagta
PG cccggc
PG cccggt
PG cctggc
PG cctggt
TL acgctg
TL acgcta
TL acactg
TL acacta
TP acgccg
TP acgcca
TP acaccg
TP acacca
AL gcttta
AL gctttg
AL gcctta
AL gccttg
AP gcgccg
AP gcgcca
AP gcaccg
AP gcacca
AP gctcca
AP gctccg
AP gcccca
AP gccccg
AN gctaat
AN gctaac
AN gccaat
AN gccaac
AS gccagc
AS gccagt
AS gctagc
AS gctagt
YP tatccg
YP tatcca
YP tacccg
YP taccca
HP catccg
HP catcca
HP cacccg
HP caccca
QR cagcga
QR cagcgg
QR caacga
QR caacgg
DL gatttg
DL gattta
DL gacttg
DL gactta
EN gaaaat
EN gaaaac
EN gagaat
EN gagaac
EK gaaaaa
EK gaaaag
EK gagaaa
EK gagaag
ER gagcga
ER gagcgg
ER gaacga
ER gaacgg
WR tggcga
WR tggcgg
RV cgggtg
RV cgggta
RV cgagtg
RV cgagta
RW cggtgg
RW cgatgg
SG agtgga
SG agtggg
SG agcgga
SG agcggg
GF ggtttt
GF ggtttc
GF ggcttt
GF ggcttc
GL gggctg
GL gggcta
GL ggactg
GL ggacta
GY gggtat
GY gggtac
GY ggatat
GY ggatac
GY ggttat
GY ggttac
GY ggctat
GY ggctac
GK ggaaaa
GK ggaaag
GK gggaaa
GK gggaag
GK ggcaag
GK ggcaaa
GK ggtaag
GK ggtaaa
GW ggctgg
GW ggttgg
GR gggcgg
GR gggcga
GR ggacgg
GR ggacga
GS ggcagc
GS ggcagt
GS ggtagc
GS ggtagt

***********************
output fileout.txt
**********************
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(HP)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(VS)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(VT)SLCPCSKKISDYGAHNQRSH(VT)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(EK)YVT(EK)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(EF)ESIHNHSAYALIERD
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp)ESIHNHSAYALIERD

atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacatatcgcaattgacagagtcggaatcaaagcgattcgt(cacccg)gttctggtcgccgataaggatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataatttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatttcggttgaatcatttgaagaaattttgcgctccatg(gtcagc)aggctggaatcagattccggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgtaaaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatatgggtttaccatgaaggtgatcgttcct(gttacc)agcctgtgcccctgctccaagaaaatatccgattacggtgcgcataaccagcgttcacac(gtcacc)atttctgtacacactaacagcttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttcggtctgctgaaacggccggat(gaaaaa)tatgtcaca(gaaaag)gcctataacaatccgaaatttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagatgcctatgttgttgaatca(gaaaac)tttgaatccatacataatcactctgcatacgcactgatagagcgc

lengths 20 - 29 : 
st)SLCPCSKKISDYGAHNQRSH(s|VTSLCPCSKKISDYGAHNQRSHV|sts
st)SLCPCSKKISDYGAHNQRSH(st|VTSLCPCSKKISDYGAHNQRSHVT|stst
t)SLCPCSKKISDYGAHNQRSH(s|TSLCPCSKKISDYGAHNQRSHV|ts
t)SLCPCSKKISDYGAHNQRSH(st|TSLCPCSKKISDYGAHNQRSHVT|tst

lengths 30 - 39 : 
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|std
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|td
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tdt
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtd
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdp
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|td
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdp

lengths 40 - 49 : 
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|stdtd
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|stdtdt
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tdtd
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|tdtdt
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtdtd
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdtdp
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|tdtd
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdtdp

lengths 50 - 59 : 
t)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(s|SRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPV|ts

lengths 60 - 69 : 
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|dps
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|dpst
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|ps
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|pst
st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st|VSRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPVT|stst
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|ststd
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|ststdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|tstd
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tstdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tstdtd

Edited by: David Halitsky on Mar 18, 2008 4:21 AM

Edited by: David Halitsky on Mar 18, 2008 4:22 AM

former_member10945 · ‎03-17-2008

As I've read through this I am wondering what you are trying to evaluate. I am sure I can suffice your requirements with just about any language on the planet --- it will just just as ugly as the C code in your example. Are you looking for a scripting language that makes it the most readable? The fastest ( does it have to be interperted or can it be complied )? The easiest to extend, etc. etc.

Language choice is all about using the right tool for the job --- I am missing what else you need it to do besides just work.

If this is simply a mental exercise in how many languages the algorithm can be built it in.... then it's not very difficult to answer --- any of them will do fine.

-d

former_member181923 · ‎03-17-2008

Well, if LISP has gotten in here, then we might as well consider MONK.

MONK is/was an MIT knock-off of LISP that SeeBeyond (now Sun) used to use to build ETD's, before they were forced to go to JAVA.

So I guess you could say that MONK was SeeBeyond's ABAP.

heh heh heh

former_member10945 · ‎03-17-2008

Although I am wholefully inept at the class of languages, most tail recursive functional languages would be best at solving this problem. This is really just a parsing problem like any other in the CS world. The most efficient languages in terms of expression for parsers is almost always functional languages. You can make Python look like a functional language but -- it isn't the way it's speediest. I might try and bust out my Scheme book this weekend and have a crack at this.

Functional Languages of Note:

Scheme, Lisp, Erlang.

-d

former_member583013 · ‎03-17-2008

David:

I already know C and C++....So it's not that hard for me to understand the code...Craig already offered you to make an PHP version....So if you want I can try to make a Ruby version -;) Ok...I want to do it anyway...So just tell me if you want me to the send the code to you -:D

Greetings,

Blag.

Former Member · ‎03-14-2008

The question should not be if the can do it or not (they can) it's a matter of how efficient they would be or which would be faster.

Trying to wrap my head around how I would do it in Perl (been awhile) if you've got the code I'll do a PHP version if you like for comparison sake - PHP like Perl can run browser based or command line based. I have a feeling Python might run quicker but that's just a hunch.

perl versus other "scripting" languages when doing string operations

Accepted Solutions (0)

Answers (19)

Answers (19)

Re: ASEGURAMIENTO DE TABLAS

Re: Getting error while Creating role in BRM

Re: How to enable the add, delete,edit system on s...

Re: CJS-30252 Running priviledged function stop of...

Successfactors login issue.