on 06-08-2008 6:58 PM
On this wiki page:
https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages
Post 3 (Protein Primary Structures from a Scripting Language Point of Vew ) presents the background needed to understand how to translate a protein gene like this:
atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat
attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa
ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac
ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt
tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc
ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta
aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat
agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata
tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt
ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac
ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa
tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac
gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg
atcgaacgagacaaaagaatacgataa
into a protein primary structure like this:
MNKQIDLPIADVQGSLDTRHIAIDRVGIKAIRHPVVVADKGGGSQHTVAQFNMYVNLPHNFKGTHMSRFV
EILNSHEREISVESFEEILRSMVSRLESDSGHIEMAFPYFINKSAPVSGVKSLLDYEVTFIGEIKHGNQY
SFTMKVIVPVTSLCPCSKKISDYGAHNQRSHVTISVRTNSFIWIEDIIRIAEEQASCELYGLLKRPDEKY
VTERAYNNPKFVEDIVRDVAEVLNHDDRIDAYIVESENFESIHNHSAYALIERDKRIR
using the "standard genetic code":
F: ttt S: tct Y: tat C: tgt
F: ttc S: tcc Y: tac C: tgc
L: tta S: tca *: taa *: tga
L: ttg S: tcg: *: tag W: tgg
L: ctt P: cct H: cat R: cgt
L: ctc P: ccc H: cac R: cgc
L: cta P: cca Q: caa R: cga
L: ctg P: ccg Q: cag R: cgg
I: att T: act N: aat S: agt
I: atc T: acc N: aac S: agc
I: ata T: aca K: aaa R: aga
M: atg T: acg K: aag R: agg
V: gtt A: gct 😧 gat G: ggtr
V: gtc A: gcc 😧 gac G: ggc
V: gta A: gca E: gaa G: gga
V: gtg A: gcg E: gag G: ggg
I'd love to have a copy of the necessary translation routine in each of the usual scripting languages - any routines posted in this thread will be added to the above wiki page.
$inseq = "atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat";
$inseq .= "attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa";
$inseq .= "ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac";
$inseq .= "ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt";
$inseq .= "tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc";
$inseq .= "ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta";
$inseq .= "aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat";
$inseq .= "agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata";
$inseq .= "tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt";
$inseq .= "ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac";
$inseq .= "ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa";
$inseq .= "tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac";
$inseq .= "gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg";
$inseq .= "atcgaacgagacaaaagaatacgataa";
$trans = array(
"ttt" => "F", "ctt" => "L", "att" => "I", "gtt" => "V",
"ttc" => "F", "ctc" => "L", "atc" => "I", "gtc" => "V",
"tct" => "S", "cta" => "L", "ata" => "I", "gta" => "V",
"tcc" => "S", "ctg" => "L", "atg" => "M", "gtg" => "V",
"tca" => "S", "cct" => "P", "act" => "T", "gct" => "A",
"tcg" => "S", "ccc" => "P", "acc" => "T", "gcc" => "A",
"tta" => "L", "cca" => "P", "aca" => "T", "gca" => "A",
"ttg" => "L", "ccg" => "P", "acg" => "T", "gcg" => "A",
"tat" => "Y", "cat" => "H", "aat" => "N", "gat" => "D",
"tac" => "Y", "cac" => "H", "aac" => "N", "gac" => "D",
"tgt" => "C", "caa" => "Q", "aaa" => "K", "gaa" => "E",
"tgc" => "C", "cag" => "Q", "aag" => "K", "gag" => "E",
"tgg" => "W", "cgt" => "R", "agt" => "S", "ggt" => "G",
"taa" => "*", "cgc" => "R", "agc" => "S", "ggc" => "G",
"tga" => "*", "cga" => "R", "aga" => "R", "gga" => "G",
"tag" => "*", "cgg" => "R", "agg" => "R", "ggg" => "G");
$inseq = strtr($inseq, "u", "t");
$substring_length[0] = -1; $stops[0] = "taa";
$substring_length[1] = -1; $stops[1] = "tga";
$substring_length[2] = -1; $stops[2] = "tag";
$i = 0;
foreach($substring_length as $sl){
$substring_length[$i] = strlen($inseq);
while($sl % 3 <> 0 && $sl <= strlen($inseq)){
$sl = strpos($inseq, $stops[$i], $sl+1);
}
if(!$sl === false){
$substring_length[$i] = $sl;
}
$i++;
}
echo strtr(substr($inseq, 0, min($substring_length)),$trans);
most of the code (apart from the definition of the input parameters) is an attempt to efficiently find the first stop codon to avoid loading and translating a sequence of several kilobytes where actually the first stop codon appears after a few bytes; if this isn't necessary the actual algorithm is a one-liner.
the language is of course ... well, a little trivial riddle (google some keywords to find out).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.
Hey Anton -
Thanks!
If it Googles like PHP, it must be PHP, huh?
Anyway, if you have a moment, could you add what's required to read the input string from a dat or txt file?
Thanks again - I'm hoping that others will follow with versions in other languages.
Also, if you have another moment, please give the "one-liner" version.
Best
djh
Aaaaand ... PHP is correct. "Very useful answer"
$inseq = ... ;
$trans = ... ;
echo strtr(strtr($inseq, "u", "t"),$trans);
this version doesn't stop at the stop codons but translates the whole sequence showing the stops as *.
adding the code to read in the input from files IMHO only distracts the reader from accessing the actual solution of the problem. everyone who will ever use this code will quickly find out how to read the input in from a file if he or she only devotes 15 minutes or so to the basics of the language.
anton.
AW -
Thanks for posting the "simplified" routine.
I see your point about the stream handling being a distraction.
I'll buy a PHP book, but I imagine it will follow standard conventions for talking to the op sys - similar to perl, etc.
Now - I wonder if the "slackers" are going to catch up to you! You're two problems ahead of them ...
Best
djh
hi djh,
a new book is always a nice thing in one's shelf but you don't necessarily need it to solve stream input in PHP(says a real bibliophile). just google it.
E.g. this [nice ressource|http://www.ibm.com/developerworks/library/os-php-readfiles/index.html]. You'll quickly see that there is not only one but a number of alternative methods, from the more classical ones to the straight forward ones. this is one of the nice features of PHP. if you've got the passion to browse through the provide link you'll find
$array = split("\n", file_get_contents("myfile"));
or a little later the the parse_ini_file(...) functions. so, reading the input file is really not an issue in PHP.
anton.
Hi Anton -
Thanks for the further guidance.
I will be sure to post all your comments in the wiki, no just the code snippets, as soon as I'm off work tomorrow.
Again, I am eager to see what you (and/or others) do with the first "interesting" problem that I will pose this weekend involving protein secondary structure.
In addition to some regex matters, this problem will also involve something we've talked about before ... submitting a query to a foreign URL from within a WDA application and parsing the html/xml that's returned.
Best regards
djh
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.