on 03-14-2008 4:57 PM
I've been told that perl is a "scripting" language like the other languages mentioned in this forum.
If that's true, can these other languages handle the following spec as well as perl can? (See spec at end of this post.)
Or is perl stronger in string operations than the other scripting languages mentioned here?
Here's the spec:
1. I give your program a twenty-letter alphabet (any twenty letter alphabet)
For example:
ABCDEFGHIJKLMNOPQRST
2. I also give your program four groups (any four groups) of letters in this alphabet:
For example:
s: A,B,C,D,E
p: F,G,H,I,J
d: K,L,M,N,O
e: P,Q,R,S,T
3. I also give your program a sequence over the twenty-letter alphabet that I gave you in Step (1) above:
For example:
ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA
4. Given this sequence,you search for pairs of adjacent letters (x,y) where X and y are from different groups (the groups defined in Step (2) above.)
Also, you return the results of this search by giving me back the following two strings:
ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
ABCD(sp)GHI(pd)LMN(de)QRSTSRQ(ed)NML(dp)IHG(ps)DCBA
5. Note: if I give you a sequence that contains "overlapping" ordered pairs like:
...EFK...
then you ignore the second ordered pair. That is, you return:
...(EF)K
closing question to get below 10
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
At Craig's suggestion, I've set up a WIKI page for this problem here:
https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticCodingProblem+1
Contributors to this thread should feel free to post their solutions as child-pages of the above page.
Contributors of multiple solutions in different languages should post each solution on a different child-page.
I will post Bill's perl and Gunter's C as solutions to Problem 2, since their programs do more than what was asked for in the original spec given in the top-post of this thread.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Craig -
Thanks for posting those files.
I just want to clarify that the second and third input files are relevant to the larger programs that Bill and Gunter wrote in perl and C.
I will be explaining these programs in "Problem 2" in the EmergTech-Bioinformatic WIKI., but folks here can probably figure out how they operate on the three input files just by looking at Gunter and Bill's code.
just wanted to add the JAVASCRIPT version, but this forum doesn't let me post it. "METHOD NOT IMPLEMENTED" it says. hmmm. no idea what that means.
anton
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
very odd, email me the script and I'll have them double check on our dev system what is what.
Once David creates the wiki area we should be able to post there as well - although everyone could post their code samples [here|https://wiki.sdn.sap.com/wiki/display/Snippets] now and label them with a tag specific to this topic.
Java version!
/**
*
* @author Gregor Brett
*
*/
public class djh
{
public static void main(String[] args)
{
String a = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA";
String[] s = {"A","B","C","D","E"};
String[] p = {"F","G","H","I","J"};
String[] d = {"K","L","M","N","O"};
String[] e = {"P","Q","R","S","T"};
String[] names = {"s","p","d","e"};
String[][] groups = {s,p,d,e};
String[] pairs = new String[(groups.length)*((groups[0].length*groups[0].length)*(groups.length-1))];
String[] pairs_codes = new String[pairs.length];
int count = 0;
for(int i=0;i<groups.length;i++)
{
for(int n=0;n<groups<i>.length;n++)
{
for(int m=0;m<groups.length;m++)
{
if(i != m)
{
for(int l=0;l<groups[m].length;l++)
{
pairs[count] = groups<i>[n] + groups[m][l];
pairs_codes[count] = names<i> + names[m];
count++;
}
}
}
}
}
String ai = a;
for(int i=0;i<pairs.length;i++)
{
a = a.replaceFirst(pairs<i>, "("+pairs<i>+")");
ai = ai.replaceFirst(pairs<i>, "("+ pairs_codes<i> +")");
}
System.out.println(a);
System.out.println(ai);
}
}
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Python version!
##############################
# @author: Gregor Brett #
##############################
import re
a = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA"
s = ["A","B","C","D","E"];
p = ["F","G","H","I","J"];
d = ["K","L","M","N","O"];
e = ["P","Q","R","S","T"];
names = ["s","p","d","e"];
groups = [s,p,d,e];
pairs = [""]*(len(groups)*((len(s)**2) *(len(groups)-1)))
pairs_codes = [""]*len(pairs)
c = 0
for i in range(0, len(groups)):
for n in range(0, len(groups<i>)):
for m in range(0, len(groups)):
if i != m:
for l in range(0, len(groups[m])):
pairs[c] = groups<i>[n] + groups[m][l]
pairs_codes[c] = names<i> + names[m]
c = c + 1
ai = a
for p in range(0, len(pairs)):
regex = re.compile(pairs[p])
a = regex.sub('('+pairs[p]+')', a, count=1)
ai = regex.sub('('+pairs_codes[p]+')', ai)
print a
print ai
I just sent Craig the files (see copy of email below.)
Note to Anton: heh heh heh ... I like your style!
Email to Craig:
Craig -
In case anyone wants to try the substring routine and the parenthesization of the nucleotide string, the attached zipfile has:
1) Gunter's c code: 20let.c;
2) Gunter's c exe: 20let.exe
3) Bill's latest perl: 20let-re.pl
4) three input files:
val1a.txt
val2a.txt
val3b.txt
5) output file: fileout.txt
On to the WIKI-page!!!!
Thanks very much again. You're being very kind.
Best
djh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
here's a solution using the exact same algorithm like the one i posted earlier. only the language used this time is a bit 'chattier' )
(or my programming skill in this language is).
looking forward to see if anyone guesses the language used.
*&---------------------------------------------------------------------*
*& Report ZTW_REGEX1
*&
*&---------------------------------------------------------------------*
*& created for djh challenge; ACW210308
*&
*&---------------------------------------------------------------------*
REPORT ZTW_REGEX1.
data: l_target type string,
l_targes type string,
lt_group type table of string,
ll_group type string,
ll_grouq type string,
lt_groupid type table of string,
ll_groupid type string,
ll_grouqid type string,
l_ind type i,
l_ine type i,
l_pattern type string,
l_replace type string.
l_target = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.
l_targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA'.
append 'ABCDE' to lt_group. append 's' to lt_groupid.
append 'FGHIJ' to lt_group. append 'p' to lt_groupid.
append 'KLMNO' to lt_group. append 'd' to lt_groupid.
append 'PQRST' to lt_group. append 'e' to lt_groupid.
loop at lt_group into ll_group.
l_ind = sy-tabix.
loop at lt_group into ll_grouq.
l_ine = sy-tabix.
if l_ind <> l_ine.
read table lt_group index l_ind into ll_group.
read table lt_group index l_ine into ll_grouq.
read table lt_groupid index l_ind into ll_groupid.
read table lt_groupid index l_ine into ll_grouqid.
concatenate '([' ll_group '][' ll_grouq '])' into l_pattern.
replace regex l_pattern in l_target with '($1)'.
concatenate '(' ll_groupid ll_grouqid ')' into l_replace.
replace regex l_pattern in l_targes with l_replace.
endif.
endloop.
endloop.
write: / l_target.
write: / l_targes.
anton
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
In Bill's perl program posted above, he put a comment in the "substrings" routine to indicate that he had a faster version in mind.
Here is his "regex" recoding of the "substrings" routine. He's pretty sure it will run faster than the original.
sub substrings {
my ($x1, $x2) = @_;
printf("\nlengths %d - %d : \n", $x1, $x2);
pos($C) = 0;
while ($C =~ /[spdt]/g) {
pos($C)-1+$x1 < length $C or last;
my $j = substr($C, pos($C)-1+$x1, $x2-$x1);
while ($j =~ /[spdt]/g) {
my $a = my $c = substr($C, pos($C)-1, pos($j)+$x1);
my $b = substr($B, pos($C)-1, pos($j)+$x1);
$b =~ tr/()//d;
$c =~ tr/a-z//cd;
print "$a|$b|$c\n";
}
}
}
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
a quick PHP solution:
<?
// djh challenge
$target = $targes = 'ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA';
$group[1] = 'ABCDE'; $groupid[1] = "s";
$group[2] = 'FGHJI'; $groupid[2] = "p";
$group[3] = 'KLMNO'; $groupid[3] = "d";
$group[4] = 'PQRST'; $groupid[4] = "e";
for($i=1; $i < count($group)+1; $i++) {
for($j=1; $j < count($group)+1; $j++) {
if ($i <> $j) {
$target = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '($1)', $target);
$targes = preg_replace('/(['.$group[$i].']['.$group[$j].'])/', '('.$groupid[$i].$groupid[$j].')', $targes);
}
}
}
echo $target. "\n" . $targes;
//fulfills 5. & yields
//ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
//QED.
?>
regards, anton
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Anton -
Thanks for the contribution - it makes the thread that much more interesting, and it was interesting already.
If we can get a Python example and a LISP or SCHEME example, I'm going to suggest that we all try to do
a "cross-walk" of all the different programs, with clear explanations of exactly how construct A in program X does the
job of construct B in program Y, construct C in program Z, etc.
Sometimes, it's easier to learn one language by learning several at the same time.
Best regards
djh
Craig -
At the risk of ticking you off, I'm going to make a more "comprehensive"
suggestion regarding file-sharing.
In 2005, I took my own web-server off line from the commercial ISP where
it was housed. (It was costing me about $400 per month to keep it running
and accesible with reasonable response time.)
Although it is an old SUN RaQ500 "appliance server" that is no longer
officially supported by Sun, it is a perfectly serviceable machine that
still runs very nicely under Linux with several PHP bulletin boards fully
configured on it.
Also, I know of two people that would probably "admin" it for free.
So how about if I crate it up and send it to you to be mounted somewhere
in SAP land, with a private IP that would be given to those in this
little collaboration group that seems to be starting up nicely here.
If you/SAP were willing to do this, then it would be very easy for me to
continue providing interesting scripting problems that would teach anyone
interested a lot about bioinformatics.
Plus, the bulletin board capability would be more convenient and take some
load off this forum (we could post back here when we have something of
more-than-usual interest to report.)
Do I have a hidden agenda here?
Absolutely.
As I said many times before, I'd like to drag SAP kicking and screaming into
the world of bioinformatics because eventually, someone somewhere is going to
realize that the way to do bioinformatics is to structure it as an SCM problem
(what makes what, where do you get it from, what else uses it, etc. etc. etc.??)
And that someone somewhere might as well be SAP rather than Oracle.
So what I'm suggesting would be an absoutely free way to set up some
infrastructure that would allow forward motion in this direction,
depending, of course, on the willigness of folks here like Ethan and Alvaro
etc to continue contributing code.
See? I told you I was going to tick you off.
Sorry! I figured it was worth a shot.
Uh - NO.
There's no reason for it, we have an entire Wiki Code Gallery, an entire Wiki for collaborating and the forums already you offer us nothing new. We've had this discussion in the past about using the Wiki (you've still not taken advantage of that).
So if you want to send me the data files I'll attach them to the forum if not then fine but we're certainly not going to host a server just for this when we have all the capabilities already.
So you want to get the community more interested then do your part and put the info into the existing tools, if the community is interested enough then SAP will begin to notice otherwise you're not going to effect much change.
Here are some comments from Bill on the situation as he sees it:
I found the ruby code and ran it on my Linux box. It takes no arguments and prints out:
mannb:dh $ ruby 20let.ruby
ABCD(EF)GHI(JK)LMN(OP)QRSTSRQ(PO)NML(KJ)IHG(FE)DCBA
Since I don't know ruby, and it doesn't do the same things, it's hard to evaluate. Certainly I'd need to look at a ruby manual to understand it. The regular expression is used to locate the stuff in (), but as Ethan said, it doesn't try to find the substrings, at least not yet.
I translated the C code to perl in 2-3 hours. I didn't try to figure out the best possible algorithm, and I tried to make the programs parallel so they would be easy to compare. The C program is not optimized for speed, space, or style. My perl program is fairly simple if you can read Perl regular expressions, an d understand the options of tr/// I'm using.
Python uses what is basically the same regular expression subroutine package as perl. Ruby seems different.
We used perl because I like it and it was flexible and speedy enough for most things. If not, C's my choice.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
David,
Well, it kept niggling, so over lunch I updated it to be a little prettier, tested and fixed it with multiple characters, and added an example of how to pass it arguments. (Edited to add that it also returns both strings requested.) I think I'm satisfied with it now
def david_halitsky_challenge(seq = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA",
groups = { 's' => %w(A B C D E),
'p' => %w(F G H I J),
'd' => %w(K L M N O),
'e' => %w(P Q R S T) })
# Arrays of letters *not* in each group
gls = groups.keys
notgroups = { gls[0] => groups[gls[1]]+groups[gls[2]]+groups[gls[3]],
gls[1] => groups[gls[0]]+groups[gls[2]]+groups[gls[3]],
gls[2] => groups[gls[1]]+groups[gls[0]]+groups[gls[3]],
gls[3] => groups[gls[1]]+groups[gls[2]]+groups[gls[0]] }
# One regex per group
regexes = Hash.new()
# Generate regex string of form ((K|L|M|N|O)(F|G|H|I|J|A|B|C|D|E|P|Q|R|S|T))
groups.keys.each do |g|
regexes[g] = '((' + groups[g].join('|') + ')(' + notgroups[g].join('|') + '))'
end
# Build the full regex.
big_ol_regex = ''
regexes.keys.each do |r|
big_ol_regex += regexes[r].to_s.reverse.chomp(r.to_s.reverse).reverse + '|'
end
big_ol_regex.chomp!('|')
big_ol_regex += '+?'
# Substitute using the first backward match to return the first result.
puts seq.gsub(Regexp.new(big_ol_regex), '(\0)')
# Replace letters with group names to build the second result.
seq_replaced = seq.gsub(Regexp.new(big_ol_regex)) do |s|
groups.keys.each do |k|
s.gsub!(Regexp.new('(' + groups[k].join('|') + ')+?'), k.to_s)
end
'(' + s + ')'
end
puts seq_replaced
end
# Method david_halitsky_challenge expects a sequence of "letters" and a hash
# with 4 keys, each pointing to an array of "letters" found in the sequence. These
# arrays are the "groups". The method assumes that the groups do not overlap
# either at the full-letter level or at the sub-letter level for multi-character letters.
#
# david_halitsky_challenge("ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA",
# { 's' => %w(A B C D E),
# 'p' => %w(F G H I J),
# 'd' => %w(K L M N O),
# 'e' => %w(P Q R S T) })
david_halitsky_challenge()
# Multi-character "letter" test.
david_halitsky_challenge("zAzBzCzDzEzFzGzHzIzJzKzLzMzNzOzPzQzRzSzTzSzRzQzPzOzNzMzLzKzJzIzHzGzFzEzDzCzBzA",
{ 'Ys' => %w(zA zB zC zD zE),
'Yp' => %w(zF zG zH zI zJ),
'Yd' => %w(zK zL zM zN zO),
'Ye' => %w(zP zQ zR zS zT) })
Edited by: Ethan Jewett on Mar 19, 2008 8:27 PM
Hey Ethan -
Glad you "couldn't resist".
No matter what the technical merits of the regex approach are or are not, I gotta say you've got style and class. "Big_ol_regex" indeed!
I've asked Bill Mann to comment on your code, and if he has the time, I'll post what he has to say.
And I'm sure some of the more savvy scriptors around here will have some thing to say about it as well.
Thanks again for posting your approach.
Dave
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Well David, I couldn't pass this one up, and so my inaugural post in the forums is a snippet of Ruby code
This is set up to run your initial example, and doesn't return the second string you ask for, but I'm out of time. I think it'll probably work on the amino acid examples (multi-character 'letters') that you give, but I haven't tested it. It takes a sequence in the alphabet, an array of group names, and four groups of letters in a hash.
Most importantly, it uses a regular expression to do the dirty work. I felt some obligation to back up all my Twitter talk over the last couple of days and though I'm no expert, this is what I came up with.
Enjoy.
def david_halitsky_challenge(seq = "ABCDEFGHIJKLMNOPQRSTSRQPONMLKJIHGFEDCBA",
gls = %w(s p d e),
groups = { 's' => %w(A B C D E),
'p' => %w(F G H I J),
'd' => %w(K L M N O),
'e' => %w(P Q R S T) })
# Arrays of letters *not* in each group
notgroups = { gls[0] => groups[gls[1]]+groups[gls[2]]+groups[gls[3]],
gls[1] => groups[gls[0]]+groups[gls[2]]+groups[gls[3]],
gls[2] => groups[gls[1]]+groups[gls[0]]+groups[gls[3]],
gls[3] => groups[gls[1]]+groups[gls[2]]+groups[gls[0]] }
# One regex per group
regexes = Hash.new()
# Generate regex string of form d((K|L|M|N|O)(F|G|H|I|J|A|B|C|D|E|P|Q|R|S|T))
gls.each do |g|
regexes[g] = '(('
groups[g].each do |l|
regexes[g] += l + '|'
end
regexes[g].chomp!('|')
regexes[g] += ')('
notgroups[g].each do |l|
regexes[g] += l + '|'
end
regexes[g].chomp!('|')
regexes[g] += '))'
end
# Build the full regex.
big_ol_regex = ''
regexes.each do |r|
big_ol_regex += r.to_s.reverse.chop.reverse + '|'
end
big_ol_regex.chomp!('|')
big_ol_regex += '+?'
# Substitute using the first backward match.
puts seq.gsub(Regexp.new(big_ol_regex), '(\0)')
end
david_halitsky_challenge()
Edited by: Ethan Jewett on Mar 19, 2008 2:22 AM
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Dan/Craig -
Yeah - Alvaro pretty much has it correct: speed and convenience. But not convenience in the usual sense.
Let's look at speed first.
Imagine you had to do the algorithm on hundreds of millions of strings, many much longer than than the strings in inputs 1 and 2. That's where the speed comes in.
But here's where the "convenience" comes in (in my sense of the word "convenience".). Suppose that in terms of speed, it looks like this (from fastest to slowest)
C/C++
perl
PHP
But suppose your best algorithm-creator actually thinks best "in PHP".
Then for that person, it's more "convenient" to frame a solution in PHP and then let others "translate" that soluition into other languages that generate faster runrimes.
Of couse, there are some who would say it's better to "think-up" the algorithm in language-neutral terms. But in my experience, that's not really the way people work.
Because the nature of each language interacts with the algorithm-creation process in very subtle ways.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Here's a perl program that Bill Mann wrote to do the same thing as the C program in the last post.
He says it may not be the fastest possible perl (but that's just him being modest!).
#!/usr/bin/perl -w
# perl version of 20let.c5
sub usage {
print("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
print("marks amino-acid-pairs from different groups in protein-file\n");
print("iff they are in the include-file\n");
}
{
@ARGV != 3 and &usage, exit 1;
#----------------define the groups
for (qw(I M V A G)) {
$G{$_} = 's';
}
for (qw(F L P W)) {
$G{$_} = 'p';
}
for (qw(H Q D E)) {
$G{$_} = 'd';
}
for (qw(S T Y N C K R)) {
$G{$_} = 't';
}
#----------------the 4 bases
$bases = '[acgt]';
#---------------- read include-file (3rd argument)
open(I, "<$ARGV[2]") or die "can't open include-file $ARGV[2]\n";
while (defined($_ = <I>)) {
/^(..) ($bases{6,6})$/io and $E{$2} = $1;
}
#------------------read amino-acid file (1st argument)
open(I, "<$ARGV[0]") or die "can't open include-file $ARGV[0]\n";
while (defined($_ = <I>)) {
tr/IMVAGFLPWHQDESTYNCKR//cd; # delete anything else
$P .= $_;
}
$p = length $P;
#------------------read nucleotide file (2nd argument)
open(I, "<$ARGV[1]") or die "can't open include-file $ARGV[1]\n";
while (defined($_ = <I>)) {
tr/acgt//cd; # delete anything else
$N .= $_;
}
length $N >= $p * 3 or
die "amino-acid file doesn't match nucleotide file\n";
#------------three output lines------------------
for ($i=0; $i < $p; ++$i) {
if ($E{substr($N, $i*3, 6)}) {
my $a = $G{substr($P, $i, 1)};
my $b = $G{substr($P, $i+1, 1)};
if (!defined $a || !defined $b) {
1;
}
if ($a ne $b)
{
$B .= '(' . substr($P, $i, 2) . ')';
$C .= '(' . $G{substr($P, $i, 1)} . $G{substr($P, $i+1, 1)} . ')';
$D .= '(' . substr($N, $i*3, 6) . ')';
++$i;
next;
}}
$B .= substr($P, $i, 1);
$C .= substr($P, $i, 1);
$D .= substr($N, $i*3, 3);
}
print $B, "\n";
print $C, "\n";
print $D, "\n";
#--------------substrings------------
substrings(20,29);
substrings(30,39);
substrings(40,49);
substrings(50,59);
substrings(60,69);
}
sub substrings {
my ($x1, $x2) = @_;
printf("\nlengths %d - %d : \n", $x1, $x2);
for ($i = 0; $i < length $C; ++$i) { # using m// and pos() might be faster
substr($C, $i, 1) =~ /[a-z]/o or next;
for ($j = $i+$x1; $j < $i+$x2 && $j < length $C; ++$j) {
substr($C, $j, 1) =~ /[a-z]/o or next;
my $a = my $c = substr($C, $i, $j-$i+1);
my $b = substr($B, $i, $j-$i+1);
$b =~ tr/()//d;
$c =~ tr/a-z//cd;
print "$a|$b|$c\n";
}
}
}
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
OK - here is the final stuff on the "C" side.
To execute the program, the command line is:
20let.exe file1.txt file2.txt file3.txt > fileout.txt
Below, I've provided:
a) source code 20let.c
b) sample input file1.txt
c) sample input file2.txt
d) sample input file3.txt
e) output fileout.txt generated from these input files.
As soon as Bill finishes the perl version of the source code, I'll post that also.
***************
source code of 20let.c
***************
// 20let.c5
#include <stdio.h>
#include <stdlib.h>
int T[333],A[99999],G[333],B[99999],C[99999],N[299999],P[99999];
int n1,n2,f,p,x1,x2,n,m,a,b,c,i,j,k,x,y,z;
int E[233][233];
FILE *file;
int substrings(int x1,int x2);
int main(int argc, char*argv[]) {
if(argc<3){
printf("\nusage:20let protein-file nucleotide-file pairs-include-file\n\n");
printf("marks amino-acid-pairs from different groups in protein-file\n");
printf("iff they are in the include-file\n");
exit(1);
}
//----------------define the groups G['I'] = 's', e.g.
x='s'; G['I']=x;G['M']=x;G['V']=x;G['A']=x;G['G']=x;
x='p'; G['F']=x;G['L']=x;G['P']=x;G['W']=x;G['W']=x;
x='d'; G['H']=x;G['Q']=x;G['D']=x;G['E']=x;G['E']=x;
x='t'; G['S']=x;G['T']=x;G['Y']=x;G['N']=x;G['C']=x;G['K']=x;G['R']=x;
//----------------the 4 bases T['a'] = 0 thru 3
for(x=0;x<222;x++)
T[x]=-999;
T['a']=0;T['c']=1;T['g']=2;T['t']=3;
T['A']=0;T['C']=1;T['G']=2;T['T']=3;
/*
for(i=65;i<70;i++)G<i>='s';
for(i=70;i<75;i++)G<i>='p';
for(i=75;i<80;i++)G<i>='d';
for(i=80;i<85;i++)G<i>='t';
*/
//---------------- read include-file file3 xxxyyy pairs E[x][y] of interest
f=0;
for(x=0;x<222;x++)
for(y=0;y<222;y++)
E[x][y]=0;
if((file=fopen(argv[3],"rb"))==NULL){
printf("\ncan't open exclude-file %s\n",argv[1]);exit(1);
}
mq1: if(feof(file))
goto mq3;
x=fgetc(file);y=fgetc(file);x=fgetc(file);
x=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
y=T[fgetc(file)]*16+T[fgetc(file)]*4+T[fgetc(file)];
if(x<64 && x>=0 && y<64 && y>=0){
E[x][y]=1;
f++;
}
mq2: if(feof(file))
goto mq3;
a=fgetc(file);
if(a!=10)
goto mq2;
goto mq1;
mq3: fclose(file);
//------------------read amino-acid file file1 == P array
if((file=fopen(argv[1],"rb"))==NULL){
printf("\ncan't open file %s\n",argv[1]);exit(1);}
p=0;
m1p: if(feof(file))
goto m2p;
p++;
P[p]=fgetc(file);
if(G[P[p]]==0)
p--;
goto m1p;
m2p:;
fclose(file);
//------------------read nucleotide file file2 == N array
if((file=fopen(argv[2],"rb"))==NULL){
printf("\ncan't open file %s\n",argv[1]);exit(1);
}
n=0;
m1n: if(feof(file))
goto m2n;
n++;
N[n]=fgetc(file);
if(N[n]!='a' && N[n]!='c' && N[n]!='g' && N[n]!='t')
n--;
goto m1n;
m2n:;
fclose(file);
//for(i=1;i<=p;i++)printf("%c",P<i>);printf("\n");
//for(i=1;i<=n;i++)printf("%c",N<i>);printf("\n");
//printf("%i include-pairs %i nucleotides %i proteins\n",f,n,p);
//------------1st line------------------ B<i> = result
m=0;
for(i=1;i<=p;i++){
n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];
//printf("\ni=%i p=%i n1=%i n2=%i\n",i,p,n1,n2);
if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
printf("%c",P<i>);
m++;
B[m]=P<i>;
goto m3;
}
printf("(%c%c)",P<i>,P[i+1]);
i++;
m++;
B[m]='(';
m++;
B[m]=P[i-1];
m++;
B[m]=P<i>;
m++;
B[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
m3:;
}
printf("\n");
//------------2nd line------------------ C<i> = result
m=0;
for(i=1;i<=p;i++){
n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];
if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
printf("%c",P<i>);
m++;
C[m]=P<i>;
goto m4;
}
printf("(%c%c)",G[P<i>],G[P[i+1]]);
i++;
m++;
C[m]='(';
m++;
C[m]=G[P[i-1]];
m++;
C[m]=G[P<i>];
m++;
C[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
m4:;
}
printf("\n");
//for(i=1;i<=m;i++)printf("%c",B<i>);printf("\n");
//------------3rd line------------------ printf only
m=0;
for(i=1;i<=p;i++){
n1=T[N[i*3-2]]*16+T[N[i*3-1]]*4+T[N[i*3]];
n2=T[N[i*3+1]]*16+T[N[i*3+2]]*4+T[N[i*3+3]];
if(E[n1][n2]<1 || G[P<i>]==G[P[i+1]] /* || i==n */){
printf("%c%c%c",N[i*3-2],N[i*3-1],N[i*3]);
goto m33;
}
printf("(%c%c%c%c%c%c)",N[i*3-2],N[i*3-1],N[i*3],N[i*3+1],N[i*3+2],N[i*3+3]);
i++;
m33:;
}
printf("\n");
//--------------substrings------------
substrings(20,29);
substrings(30,39);
substrings(40,49);
substrings(50,59);
substrings(60,69);
return 0;
}
int substrings(int x1,int x2)
{
printf("\n");
printf("lengths %i - %i : \n",x1,x2);
for(i=1; i<p; i++)
for (j=i+x1; j<i+x2; j++) {
if (C<i>>95 && C[j]>95) { // if lc letter in line2
for(x=i;x<=j;x++)
printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)
if(B[x]>44) // if not () in line 1
printf("%c",B[x]);
printf("|");
for(x=i;x<=j;x++)
if(C[x]>95) // if lc letter line2
printf("%c",C[x]);
printf("\n");}
}
}
******************
input file1.txt
******************
MKKHTDQPIADVQGSPDTRH
IAIDRVGIKAIRHPVLVADK
DGGSQHTVAQFNMYVNLPHN
FKGTHMSRFVEILNSHEREI
SVESFEEILRSMVSRLESDS
GHIEMTFPYFVNKSAPISGV
KSLLDYEVTFIGEIKHGDQY
GFTMKVIVPVTSLCPCSKKI
SDYGAHNQRSHVTISVHTNS
FVWIEDVIRIAEEQASCELF
GLLKRPDEKYVTEKAYNNPK
FVEDIVRDVAEILNHDDRID
AYVVESEBFESIHNHSAYAL
IERD
***********************
input file2.txt
**********************
atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacat
atcgcaattgacagagtcggaatcaaagcgattcgtcacccggttctggtcgccgataag
gatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataat
ttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatt
tcggttgaatcatttgaagaaattttgcgctccatggtcagcaggctggaatcagattcc
ggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgta
aaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatat
gggtttaccatgaaggtgatcgttcctgttaccagcctgtgcccctgctccaagaaaata
tccgattacggtgcgcataaccagcgttcacacgtcaccatttctgtacacactaacagc
ttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttc
ggtctgctgaaacggccggatgaaaaatatgtcacagaaaaggcctataacaatccgaaa
tttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagat
gcctatgttgttgaatcagaaaactttgaatccatacataatcactctgcatacgcactg
atagagcgcgac
******************
input file3.txt
******************
FA tttgcc
FA ttcgcc
FA tttgct
FA ttcgct
LK ttaaaa
LK ttgaaa
LK ttaaag
LK ttgaag
LS ctgctc
LS ctgctt
LS ctactc
LS ctactt
LT ctcacc
LT ctcact
LT cttacc
LT cttact
LY ctctac
LY ctctat
LY ctttac
LY ctttat
LG ctcggc
LG ctcggt
LG cttggc
LG cttggt
IP attccc
IP attcct
IP atcccc
IP atccct
IP attcca
IP attccg
IP atccca
IP atcccg
ML atgctc
ML atgctt
ML atgctc
ML atgctt
VL gtgctg
VL gtgcta
VL gtactg
VL gtacta
VS gtgtcc
VS gtatct
VS gtgtcc
VS gtatct
VT gtcacc
VT gtcact
VT gttacc
VT gttact
VS gtcagc
VS gtcagt
VS gttagc
VS gttagt
SL tcgctg
SL tcgcta
SL tcactg
SL tcacta
SP tctcca
SP tctccg
SP tcccca
SP tccccg
PV ccggtg
PV ccggta
PV ccagtg
PV ccagta
PG cccggc
PG cccggt
PG cctggc
PG cctggt
TL acgctg
TL acgcta
TL acactg
TL acacta
TP acgccg
TP acgcca
TP acaccg
TP acacca
AL gcttta
AL gctttg
AL gcctta
AL gccttg
AP gcgccg
AP gcgcca
AP gcaccg
AP gcacca
AP gctcca
AP gctccg
AP gcccca
AP gccccg
AN gctaat
AN gctaac
AN gccaat
AN gccaac
AS gccagc
AS gccagt
AS gctagc
AS gctagt
YP tatccg
YP tatcca
YP tacccg
YP taccca
HP catccg
HP catcca
HP cacccg
HP caccca
QR cagcga
QR cagcgg
QR caacga
QR caacgg
DL gatttg
DL gattta
DL gacttg
DL gactta
EN gaaaat
EN gaaaac
EN gagaat
EN gagaac
EK gaaaaa
EK gaaaag
EK gagaaa
EK gagaag
ER gagcga
ER gagcgg
ER gaacga
ER gaacgg
WR tggcga
WR tggcgg
RV cgggtg
RV cgggta
RV cgagtg
RV cgagta
RW cggtgg
RW cgatgg
SG agtgga
SG agtggg
SG agcgga
SG agcggg
GF ggtttt
GF ggtttc
GF ggcttt
GF ggcttc
GL gggctg
GL gggcta
GL ggactg
GL ggacta
GY gggtat
GY gggtac
GY ggatat
GY ggatac
GY ggttat
GY ggttac
GY ggctat
GY ggctac
GK ggaaaa
GK ggaaag
GK gggaaa
GK gggaag
GK ggcaag
GK ggcaaa
GK ggtaag
GK ggtaaa
GW ggctgg
GW ggttgg
GR gggcgg
GR gggcga
GR ggacgg
GR ggacga
GS ggcagc
GS ggcagt
GS ggtagc
GS ggtagt
***********************
output fileout.txt
**********************
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(HP)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(VS)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(VT)SLCPCSKKISDYGAHNQRSH(VT)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(EK)YVT(EK)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(EF)ESIHNHSAYALIERD
MKKHTDQPIADVQGSPDTRHIAIDRVGIKAIR(dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp)ESIHNHSAYALIERD
atgaaaaaacatactgatcaacctatcgctgatgtgcagggctcaccggataccagacatatcgcaattgacagagtcggaatcaaagcgattcgt(cacccg)gttctggtcgccgataaggatggtggttcccagcataccgtggcgcaatttaatatgtacgtcaatctgccacataatttcaaagggacgcatatgtcccgttttgtggagatactaaatagccacgaacgtgaaatttcggttgaatcatttgaagaaattttgcgctccatg(gtcagc)aggctggaatcagattccggccatattgaaatgacttttccctacttcgtcaataaatcagcccctatctcaggtgtaaaaagcttgctggattatgaggtaacctttatcggcgaaattaaacatggcgatcaatatgggtttaccatgaaggtgatcgttcct(gttacc)agcctgtgcccctgctccaagaaaatatccgattacggtgcgcataaccagcgttcacac(gtcacc)atttctgtacacactaacagcttcgtctggattgaggacgttatcagaattgcggaagaacaggcctcatgcgaactgttcggtctgctgaaacggccggat(gaaaaa)tatgtcaca(gaaaag)gcctataacaatccgaaatttgtcgaagatatcgtccgtgatgtcgccgaaatacttaatcatgatgaccggatagatgcctatgttgttgaatca(gaaaac)tttgaatccatacataatcactctgcatacgcactgatagagcgc
lengths 20 - 29 :
st)SLCPCSKKISDYGAHNQRSH(s|VTSLCPCSKKISDYGAHNQRSHV|sts
st)SLCPCSKKISDYGAHNQRSH(st|VTSLCPCSKKISDYGAHNQRSHVT|stst
t)SLCPCSKKISDYGAHNQRSH(s|TSLCPCSKKISDYGAHNQRSHV|ts
t)SLCPCSKKISDYGAHNQRSH(st|TSLCPCSKKISDYGAHNQRSHVT|tst
lengths 30 - 39 :
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|std
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|td
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tdt
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtd
dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdp
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|td
t)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdp
lengths 40 - 49 :
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|stdtd
st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|VTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|stdtdt
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tdtd
t)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(dt|TISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTEK|tdtdt
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|dtdtd
dt)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|EKYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|dtdtdp
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(d|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESE|tdtd
t)YVT(dt)AYNNPKFVEDIVRDVAEILNHDDRIDAYVVES(dp|KYVTEKAYNNPKFVEDIVRDVAEILNHDDRIDAYVVESEF|tdtdp
lengths 50 - 59 :
t)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(s|SRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPV|ts
lengths 60 - 69 :
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|dps
dp)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|HPVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|dpst
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(s|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMV|ps
p)VLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSM(st|PVLVADKDGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVS|pst
st)RLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVP(st|VSRLESDSGHIEMTFPYFVNKSAPISGVKSLLDYEVTFIGEIKHGDQYGFTMKVIVPVT|stst
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|ststd
st)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|VTSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|ststdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDE|tstd
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEK|tstdt
t)SLCPCSKKISDYGAHNQRSH(st)ISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPD(dt)YVT(d|TSLCPCSKKISDYGAHNQRSHVTISVHTNSFVWIEDVIRIAEEQASCELFGLLKRPDEKYVTE|tstdtd
Edited by: David Halitsky on Mar 18, 2008 4:21 AM
Edited by: David Halitsky on Mar 18, 2008 4:22 AM
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
As I've read through this I am wondering what you are trying to evaluate. I am sure I can suffice your requirements with just about any language on the planet --- it will just just as ugly as the C code in your example. Are you looking for a scripting language that makes it the most readable? The fastest ( does it have to be interperted or can it be complied )? The easiest to extend, etc. etc.
Language choice is all about using the right tool for the job --- I am missing what else you need it to do besides just work.
If this is simply a mental exercise in how many languages the algorithm can be built it in.... then it's not very difficult to answer --- any of them will do fine.
-d
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Dan:
Sure, we can use any available language...And God knows there are plenty of them...I guess that David is looking for Speed and ease of development...
As you said...C code is somehow ugly...And Ruby would be cleaner and easier...
Actually in my book "El Arte de Programar" I translated 5 algorithms into 14 Programming languages (Including ABAP, QBasic, Python, Perl, PHP and others)...Just to help people decide which language is easier to learn -;)
Greetings,
Blag.
Well, if LISP has gotten in here, then we might as well consider MONK.
MONK is/was an MIT knock-off of LISP that SeeBeyond (now Sun) used to use to build ETD's, before they were forced to go to JAVA.
So I guess you could say that MONK was SeeBeyond's ABAP.
heh heh heh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Although I am wholefully inept at the class of languages, most tail recursive functional languages would be best at solving this problem. This is really just a parsing problem like any other in the CS world. The most efficient languages in terms of expression for parsers is almost always functional languages. You can make Python look like a functional language but -- it isn't the way it's speediest. I might try and bust out my Scheme book this weekend and have a crack at this.
Functional Languages of Note:
Scheme, Lisp, Erlang.
-d
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Dan -
Thanks very much for weighing in.
If you're going to try it in scheme, then as I said to Alvaro, please wait till tomorrow - I want to post a fuller version of Gunter's C code with three inputs and more outputs.
Also, I will probably have some a really neat perl version posted as well - if Bill Mann has the time to write it tonight (like 3.5 minutes for him.)
Thanks again
djh
David:
I already know C and C++....So it's not that hard for me to understand the code...Craig already offered you to make an PHP version....So if you want I can try to make a Ruby version -;) Ok...I want to do it anyway...So just tell me if you want me to the send the code to you -:D
Greetings,
Blag.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Alvaro -
Yes - I would love to see a Ruby version.
But please wait till tomorrow - I want to send you a fuller version of the C program that takes three files as input and prints some additional kinds of output.
Then, if you do the PHP, we will have C, perl, Ruby, and PHP versions of the program, and we can probably have an interesting discussion of what's "better" to use when calling the program from SAP via RFC.
Thanks very much again.
Best
djh
The question should not be if the can do it or not (they can) it's a matter of how efficient they would be or which would be faster.
Trying to wrap my head around how I would do it in Perl (been awhile) if you've got the code I'll do a PHP version if you like for comparison sake - PHP like Perl can run browser based or command line based. I have a feeling Python might run quicker but that's just a hunch.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hi Craig -
Thanks for taking the time to reply.
Here's a C program that does a little more than the spec. It was written by a fellow named Gunter Sterten over on your side of the pond ... in Germany.
Following the program is a sample of input and output. (I've truncated the output because it's quite large.)
#include <stdio.h>
int A[99999],G[333],B[99999],C[99999];
int x1,x2,n,m,a,b,c,i,j,k,x,y,z;
FILE *file;
int main(int argc,char*argv[]){
if(argc<2){printf("\nusage:20let file\n\n");
printf("marks pairs from different groups in file\n");
exit(1);}
x='s';G['I']=x;G['M']=x;G['V']=x;G['A']=x;G['G']=x;
x='p';G['F']=x;G['L']=x;G['P']=x;G['W']=x;G['W']=x;
x='d';G['H']=x;G['Q']=x;G['D']=x;G['E']=x;G['E']=x;
x='t';G['S']=x;G['T']=x;G['Y']=x;G['N']=x;G['C']=x;G['K']=x;G['R']=x;
/*
for(i=65;i<70;i++)G<i>='s';
for(i=70;i<75;i++)G<i>='p';
for(i=75;i<80;i++)G<i>='d';
for(i=80;i<85;i++)G<i>='t';
*/
if((file=fopen(argv[1],"rb"))==NULL){printf("\ncan't open file %s\n",argv[1]);exit(1);}
n=0;
m1:if(feof(file))goto m2;
n++;A[n]=fgetc(file);if(G[A[n]]==0)n--;
goto m1;
m2:;
//for(i=1;i<=n;i++)printf("%c",A<i>);printf("\n");
//for(i=1;i<=n;i++)printf("%c",G[A<i>]);printf("\n");
m=0;for(i=1;i<=n;i++){
if(G[A<i>]==G[A[i+1]] || i==n){printf("%c",A<i>);m++;B[m]=A<i>;goto m3;}
printf("(%c%c)",A<i>,A[i+1]);i++;
m++;B[m]='(';m++;B[m]=A[i-1];m++;B[m]=A<i>;m++;B[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
m3:;}printf("\n");
m=0;for(i=1;i<=n;i++){
//printf("i=%i A<i>=%i\n",i,A<i>);
if(G[A<i>]==G[A[i+1]] || i==n){printf("%c",A<i>);m++;C[m]=A<i>;goto m4;}
printf("(%c%c)",G[A<i>],G[A[i+1]]);i++;
m++;C[m]='(';m++;C[m]=G[A[i-1]];m++;C[m]=G[A<i>];m++;C[m]=')';
//printf("(%c)%c",G[A<i>],G[A[i+1]]);i++;
m4:;}printf("\n");
//for(i=1;i<=m;i++)printf("%c",B<i>);printf("\n");
printf("\n");x1=20;x2=29;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
if(C<i>>95 && C[j]>95){
for(x=i;x<=j;x++)printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
printf("\n");}
}
printf("\n");x1=30;x2=39;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
if(C<i>>95 && C[j]>95){
for(x=i;x<=j;x++)printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
printf("\n");}
}
printf("\n");x1=40;x2=49;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
if(C<i>>95 && C[j]>95){
for(x=i;x<=j;x++)printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
printf("\n");}
}
printf("\n");x1=50;x2=59;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
if(C<i>>95 && C[j]>95){
for(x=i;x<=j;x++)printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
printf("\n");}
}
printf("\n");x1=60;x2=69;
for(i=1;i<m;i++)for(j=i+x1;j<i+x2;j++) {
if(C<i>>95 && C[j]>95){
for(x=i;x<=j;x++)printf("%c",C[x]);
printf("|");
for(x=i;x<=j;x++)if(B[x]>44)printf("%c",B[x]);
printf("\n");}
}
}
Input:
MNKQIDLPIADVQGSLDTRHIAIDRVGIKAIRHPVVVADKGGGSQHTVAQFNMYVNLPHNFKGTHMSRFVEILNSHEREISVESFEEILRSMVSRLESDSGHIEMAFPYFINKSAPVSGVKSLLDYEVTFIGEIKHGNQYSFTMKVIVPVTSLCPCSKKISDYGAHNQRSHVTISVRTNSFIWIEDIIRIAEEQASCELYGLLKRPDEKYVTERAYNNPKFVEDIVRDVAEVLNHDDRIDAYIVESENFESIHNHSAYALIERDKRIR
Outputs (in same output file):
(MN)(KQ)(ID)L(PI)(AD)(VQ)(GS)(LD)T(RH)IA(ID)(RV)G(IK)A(IR)(HP)VVV(AD)(KG)G(GS)Q(HT)V(AQ)(FN)(MY)(VN)L(PH)(NF)(KG)(TH)(MS)(RF)(VE)(IL)N(SH)(ER)(EI)(SV)(ES)(FE)(EI)(LR)(SM)(VS)(RL)(ES)(DS)(GH)(IE)M(AF)(PY)(FI)NK(SA)(PV)(SG)(VK)(SL)(LD)(YE)(VT)(FI)(GE)(IK)(HG)(NQ)Y(SF)(TM)(KV)I(VP)(VT)(SL)(CP)CSK(KI)(SD)(YG)(AH)(NQ)R(SH)(VT)(IS)(VR)TN(SF)(IW)(IE)(DI)(IR)I(AE)E(QA)S(CE)(LY)(GL)(LK)(RP)D(EK)(YV)(TE)(RA)YN(NP)(KF)(VE)(DI)(VR)(DV)(AE)(VL)(NH)D(DR)(ID)(AY)I(VE)(SE)(NF)(ES)(IH)(NH)(SA)(YA)(LI)(ER)(DK)(RI)R
(st)(td)(sd)L(ps)(sd)(sd)(st)(pd)T(td)IA(sd)(ts)G(st)A(st)(dp)VVV(sd)(ts)G(st)Q(dt)V(sd)(pt)(st)(st)L(pd)(tp)(ts)(td)(st)(tp)(sd)(sp)N(td)(dt)(ds)(ts)(dt)(pd)(ds)(pt)(ts)(st)(tp)(dt)(dt)(sd)(sd)M(sp)(pt)(ps)NK(ts)(ps)(ts)(st)(tp)(pd)(td)(st)(ps)(sd)(st)(ds)(td)Y(tp)(ts)(ts)I(sp)(st)(tp)(tp)CSK(ts)(td)(ts)(sd)(td)R(td)(st)(st)(st)TN(tp)(sp)(sd)(ds)(st)I(sd)E(ds)S(td)(pt)(sp)(pt)(tp)D(dt)(ts)(td)(ts)YN(tp)(tp)(sd)(ds)(st)(ds)(sd)(sp)(td)D(dt)(sd)(st)I(sd)(td)(tp)(dt)(sd)(td)(ts)(ts)(ps)(dt)(dt)(ts)R
st)(td)(sd)L(ps)(sd)(s|MNKQIDLPIADV
st)(td)(sd)L(ps)(sd)(sd|MNKQIDLPIADVQ
st)(td)(sd)L(ps)(sd)(sd)(s|MNKQIDLPIADVQG
st)(td)(sd)L(ps)(sd)(sd)(st|MNKQIDLPIADVQGS
t)(td)(sd)L(ps)(sd)(s|NKQIDLPIADV
t)(td)(sd)L(ps)(sd)(sd|NKQIDLPIADVQ
t)(td)(sd)L(ps)(sd)(sd)(s|NKQIDLPIADVQG
t)(td)(sd)L(ps)(sd)(sd)(st|NKQIDLPIADVQGS
t)(td)(sd)L(ps)(sd)(sd)(st)(p|NKQIDLPIADVQGSL
td)(sd)L(ps)(sd)(sd)(s|KQIDLPIADVQG
td)(sd)L(ps)(sd)(sd)(st|KQIDLPIADVQGS
td)(sd)L(ps)(sd)(sd)(st)(p|KQIDLPIADVQGSL
td)(sd)L(ps)(sd)(sd)(st)(pd|KQIDLPIADVQGSLD
d)(sd)L(ps)(sd)(sd)(s|QIDLPIADVQG
d)(sd)L(ps)(sd)(sd)(st|QIDLPIADVQGS
d)(sd)L(ps)(sd)(sd)(st)(p|QIDLPIADVQGSL
d)(sd)L(ps)(sd)(sd)(st)(pd|QIDLPIADVQGSLD
sd)L(ps)(sd)(sd)(st)(p|IDLPIADVQGSL
sd)L(ps)(sd)(sd)(st)(pd|IDLPIADVQGSLD
sd)L(ps)(sd)(sd)(st)(pd)T(t|IDLPIADVQGSLDTR
sd)L(ps)(sd)(sd)(st)(pd)T(td|IDLPIADVQGSLDTRH
Edited by: Craig Cmehil on Mar 15, 2008 10:41 AM (applied "code" format for better reading)
Craig -
I think I can get the perl for this very quickly ... so please hold off and keep an eye out for a post with the perl in a couple of days.
Please see also my note to Alvaro ... I want to send a fuller version of the program that will be better for teaching you guys about the basics of bioinformatics.
Thanks very much again.
djh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.