cancel
Showing results for 
Search instead for 
Did you mean: 

2nd Bioinformatic Coding Problem: genes into protein primary structures

former_member181923
Active Participant
0 Kudos

On this wiki page:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages

Post 3 (Protein Primary Structures from a Scripting Language Point of Vew ) presents the background needed to understand how to translate a protein gene like this:

atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat

attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa

ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac

ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt

tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc

ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta

aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat

agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata

tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt

ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac

ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa

tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac

gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg

atcgaacgagacaaaagaatacgataa

into a protein primary structure like this:

MNKQIDLPIADVQGSLDTRHIAIDRVGIKAIRHPVVVADKGGGSQHTVAQFNMYVNLPHNFKGTHMSRFV

EILNSHEREISVESFEEILRSMVSRLESDSGHIEMAFPYFINKSAPVSGVKSLLDYEVTFIGEIKHGNQY

SFTMKVIVPVTSLCPCSKKISDYGAHNQRSHVTISVRTNSFIWIEDIIRIAEEQASCELYGLLKRPDEKY

VTERAYNNPKFVEDIVRDVAEVLNHDDRIDAYIVESENFESIHNHSAYALIERDKRIR

using the "standard genetic code":

F: ttt S: tct Y: tat C: tgt

F: ttc S: tcc Y: tac C: tgc

L: tta S: tca *: taa *: tga

L: ttg S: tcg: *: tag W: tgg

L: ctt P: cct H: cat R: cgt

L: ctc P: ccc H: cac R: cgc

L: cta P: cca Q: caa R: cga

L: ctg P: ccg Q: cag R: cgg

I: att T: act N: aat S: agt

I: atc T: acc N: aac S: agc

I: ata T: aca K: aaa R: aga

M: atg T: acg K: aag R: agg

V: gtt A: gct 😧 gat G: ggtr

V: gtc A: gcc 😧 gac G: ggc

V: gta A: gca E: gaa G: gga

V: gtg A: gcg E: gag G: ggg

I'd love to have a copy of the necessary translation routine in each of the usual scripting languages - any routines posted in this thread will be added to the above wiki page.

Accepted Solutions (0)

Answers (1)

Answers (1)

Former Member
0 Kudos

$inseq  = "atgaacaaacagatcgatctacccattgctgatgtacaaggctcgttggacacaagacat";
$inseq .= "attgccatcgacagagtaggaatcaaagcgatccggcatcctgtcgtggtggcagataaa";
$inseq .= "ggcggtggctcccagcataccgtggcgcaattcaatatgtacgtcaatctgccccacaac";
$inseq .= "ttcaagggaacccacatgtctcgctttgtcgagatactgaacagtcacgagcgcgagatt";
$inseq .= "tcggtcgaatcgttcgaggaaatcctgcgttccatggtcagcagactggaatcggattcc";
$inseq .= "ggacatatcgaaatggccttcccttacttcatcaataaatctgcacctgtctcgggtgta";
$inseq .= "aaaagcctgctggactacgaagtgacatttatcggtgagatcaaacacggcaatcaatat";
$inseq .= "agttttaccatgaaggtaatcgtccctgttaccagcctgtgcccctgctccaaaaaaata";
$inseq .= "tccgactacggtgcacacaaccagcgttcacatgtcacgatttcggtgcgtaccaatagt";
$inseq .= "ttcatctggatcgaggacatcatcagaatcgcggaagagcaggcctcatgcgaactgtac";
$inseq .= "ggcctgctgaaacgcccggatgaaaaatatgttacggaaagagcttacaacaatccgaaa";
$inseq .= "tttgtcgaagatatcgtccgcgatgtggccgaagtactcaaccacgatgaccgtatagac";
$inseq .= "gcctatatcgttgaatcagaaaatttcgaatccatacacaaccactctgcctacgcattg";
$inseq .= "atcgaacgagacaaaagaatacgataa";

$trans  = array(
"ttt" => "F", "ctt" => "L", "att" => "I", "gtt" => "V",
"ttc" => "F", "ctc" => "L", "atc" => "I", "gtc" => "V",
"tct" => "S", "cta" => "L", "ata" => "I", "gta" => "V",
"tcc" => "S", "ctg" => "L", "atg" => "M", "gtg" => "V",
"tca" => "S", "cct" => "P", "act" => "T", "gct" => "A",
"tcg" => "S", "ccc" => "P", "acc" => "T", "gcc" => "A",
"tta" => "L", "cca" => "P", "aca" => "T", "gca" => "A",
"ttg" => "L", "ccg" => "P", "acg" => "T", "gcg" => "A",
"tat" => "Y", "cat" => "H", "aat" => "N", "gat" => "D",
"tac" => "Y", "cac" => "H", "aac" => "N", "gac" => "D",
"tgt" => "C", "caa" => "Q", "aaa" => "K", "gaa" => "E",
"tgc" => "C", "cag" => "Q", "aag" => "K", "gag" => "E",
"tgg" => "W", "cgt" => "R", "agt" => "S", "ggt" => "G",
"taa" => "*", "cgc" => "R", "agc" => "S", "ggc" => "G",
"tga" => "*", "cga" => "R", "aga" => "R", "gga" => "G",
"tag" => "*", "cgg" => "R", "agg" => "R", "ggg" => "G");

$inseq = strtr($inseq, "u", "t");
$substring_length[0] = -1; $stops[0] = "taa";
$substring_length[1] = -1; $stops[1] = "tga";
$substring_length[2] = -1; $stops[2] = "tag";
$i = 0;
foreach($substring_length as $sl){
  $substring_length[$i] = strlen($inseq);
  while($sl % 3 <> 0 && $sl <= strlen($inseq)){
    $sl = strpos($inseq, $stops[$i], $sl+1);
  }
  if(!$sl === false){
    $substring_length[$i] = $sl;
  }
  $i++;
}
echo strtr(substr($inseq, 0, min($substring_length)),$trans);

most of the code (apart from the definition of the input parameters) is an attempt to efficiently find the first stop codon to avoid loading and translating a sequence of several kilobytes where actually the first stop codon appears after a few bytes; if this isn't necessary the actual algorithm is a one-liner.

the language is of course ... well, a little trivial riddle (google some keywords to find out).

former_member181923
Active Participant
0 Kudos

Hey Anton -

Thanks!

If it Googles like PHP, it must be PHP, huh?

Anyway, if you have a moment, could you add what's required to read the input string from a dat or txt file?

Thanks again - I'm hoping that others will follow with versions in other languages.

Also, if you have another moment, please give the "one-liner" version.

Best

djh

Former Member
0 Kudos

Aaaaand ... PHP is correct. "Very useful answer"


$inseq = ... ;
$trans = ... ;
echo strtr(strtr($inseq, "u", "t"),$trans);

this version doesn't stop at the stop codons but translates the whole sequence showing the stops as *.

adding the code to read in the input from files IMHO only distracts the reader from accessing the actual solution of the problem. everyone who will ever use this code will quickly find out how to read the input in from a file if he or she only devotes 15 minutes or so to the basics of the language.

anton.

former_member181923
Active Participant
0 Kudos

AW -

Thanks for posting the "simplified" routine.

I see your point about the stream handling being a distraction.

I'll buy a PHP book, but I imagine it will follow standard conventions for talking to the op sys - similar to perl, etc.

Now - I wonder if the "slackers" are going to catch up to you! You're two problems ahead of them ...

Best

djh

Former Member
0 Kudos

hi djh,

a new book is always a nice thing in one's shelf but you don't necessarily need it to solve stream input in PHP(says a real bibliophile). just google it.

E.g. this [nice ressource|http://www.ibm.com/developerworks/library/os-php-readfiles/index.html]. You'll quickly see that there is not only one but a number of alternative methods, from the more classical ones to the straight forward ones. this is one of the nice features of PHP. if you've got the passion to browse through the provide link you'll find


$array = split("\n", file_get_contents("myfile"));

or a little later the the parse_ini_file(...) functions. so, reading the input file is really not an issue in PHP.

anton.

former_member181923
Active Participant
0 Kudos

Hi Anton -

Thanks for the further guidance.

I will be sure to post all your comments in the wiki, no just the code snippets, as soon as I'm off work tomorrow.

Again, I am eager to see what you (and/or others) do with the first "interesting" problem that I will pose this weekend involving protein secondary structure.

In addition to some regex matters, this problem will also involve something we've talked about before ... submitting a query to a foreign URL from within a WDA application and parsing the html/xml that's returned.

Best regards

djh