cancel
Showing results for 
Search instead for 
Did you mean: 

New bioinformatic coding problems (4a-b) added to WIKI

Former Member
0 Kudos

I've stated Coding Problems 4a-b at the end of this WIKI page:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages

Accepted Solutions (0)

Answers (2)

Answers (2)

Former Member
0 Kudos

to be honest, I didn't understand the need for 4b. so I played around with something different, which is quiet helpful and illustrative if someone wants to seriously code stuff like this.

the PDB files can be found at an FTP server of a not for profit or governmental site. there are some 300.000+ files with new and updated files regularly added. so, it would be nice to find a way to autmatically download them if needed and kind of cache them locally.

moreover the files are compressed, which is a good idea because even in compressed form the whole data set makes up 70 something GB, which tsakes some time to download and might get you into trouble with your internet provider. so, it would be nice to be able to work with zipped files on the fly.

following is the solution. a little function which checks locally if the file has already been downloaded earlier. if yes, it uses this file if no, it downloads the file in question, saves it locally, unzips it and searches for the sequence as usual.


function loadPDB($inFile){
    
  $ftp_server    = "ftp.wwpdb.org";
  $ftp_user_name = "anoynmous";
  $ftp_user_pass = "adsad@adsad";
  $pdb_path      = "C:\some_path\\ents\\";
  $localFile     =  $pdb_path . $inFile;
  
  if(!file_exists($localFile)) {
    $cid           = ftp_connect($ftp_server);
    $login_result  = ftp_login($cid, $ftp_user_name, $ftp_user_pass);
    ftp_chdir($cid, "/pub/pdb/data/structures/all/pdb");

    if (!ftp_get($cid, $localFile, $inFile, FTP_BINARY)) {
      ftp_close($cid);
      return null; // a problem occured
    }
    ftp_close($cid);
  }  
  $ifp   = gzopen($localFile, "r");
  $inarr = split("\n", gzread($ifp, 100000000));
  return $inarr;
}

the earlier script has to be adopted like follows


<?
function loadPDB($inFile){
 ...
}

// ------- sequence to find
$seq2find = "IDKIKIAADVIRNGGTVAF";

// ------- loading PDB file
$inlines = loadPDB("pdb1kna.ent.gz");

// ------- loading amino acid map
....

This last snippet would search the sequence IDKIKIAADVIRNGGTVAF in pdb1kna.ent.gz no matter if it were already downloaded or still to be downloaded from the ftp site.

have fun.

former_member181923
Active Participant
0 Kudos

You rule again, Anton!

In fact, I think you rule in perpetuity.

Anyway, regarding problem 4b - you're right - it's really just a convenience program.

But keep an eye out in the WIKI for problem 5 - it's a "fuzzy" logic problem - I'll try to write it up tonight.

Also, I agree with everything you said about working from downloaded files, but there is more to be said about this topic ... when we finally get to the point where we can talk about WDA calling PHP routines to query STRIDE or PDB and parsing the returned pages.

Best regards

djh

Former Member
0 Kudos

here's a solution to 4a. to keep the coding compact, we put the 'amino acid map' into an external file and read it in at runtime. The layout is expected to be like:


D:ASP:aspartic acid
E:GLU:glutamic acid
F:PHE:phenylalanine
...

The following script should still be self explaining and therefore is only minimally commented. It features file input, some nice PHP functions like array_flip() and a regular expression to identify and parse the atom lines.


<?
// ------- sequence to find
$seq2find = "IDKIKIAADVIRNGGTVAF";

// ------- loading PDB file
$inlines = split("\n", file_get_contents("C:\some_path\\pdbs\\pdb2eqa.ent"));

// ------- loading amino acid map
$acids_in  = split("\n", file_get_contents("C:\some_path\\cfgs\\amino_acids.cfg"));
foreach($acids_in as $acf) {
  $acid_tmp = explode(":", $acf);
  $acid_keys[$acid_tmp[0]] = $acid_tmp[1];
}

// ------- flip amino acid map
$acid_keys_rev[] = array_flip($acid_keys);

$seq = ""; $old = 0; $spos[0] = '---';

// ------- the pattern of target lines in the PDB files
$atom_pattern = "/^ATOM\s*(\d*)\s*([A-Z0-9]*)\s*([A-Z]{3})\s*[A-Z]*\s*(\d*).*/i"; 

// ------- create a primary structure string for the PDB file
foreach($inlines as $inline) {
  preg_match($atom_pattern, $inline, $matches);  
  if (isset($matches[3])) {
    if($old <> $matches[4] ) {
      $seq   .= $acid_keys_rev[0][$matches[3]]; 
      $spos[] = $matches[1];
    }
    $old = $matches[4];
  }
}

// ------- find sequence to find in primary structure
$startline = $spos[strpos($seq, $seq2find)+1];
$endline   = $spos[strpos($seq, $seq2find)+ strlen($seq2find)+1];

// ------- collect PDB's atom lines according to the sequence to find
if (isset($startline) && $startline <> "---") {
  foreach($inlines as $inline) {
    preg_match($atom_pattern, $inline, $matches);  
    if (isset($matches[3])) {
      if ($matches[1]>=$startline && $matches[1]<=$endline-1) {
        echo $inline . "\n";
      }
    }
  }
}
else {
  echo "Sequence not found!";
}
?>

If someone out there knows something about the intermittent HETATM lines, educate me

anton

former_member181923
Active Participant
0 Kudos

You rule, Anton!

Regarding "HETATM" lines:

proteins can contain foreign objects other than amino acids, e.g. metal ions are typically used to aid catalysis.

the hetatm lines are for the atoms of those kinds of objects

remember, when they do x-ray crystallography on a protein, the get a position (3d coordinates for every atom of every object in the structure they've crystallized - not just the atoms of the amino acids of the protein (the atom lines.)

Glad you asked the question - I knew it would come up.

Best

djh