New bioinformatic coding problems (4a-b) added to ...

Former Member · ‎06-21-2008

I've stated Coding Problems 4a-b at the end of this WIKI page:

https://wiki.sdn.sap.com/wiki/display/EmTech/Bio-InformaticBasicsInRelationtoScriptingLanguages

Former Member · ‎06-24-2008

to be honest, I didn't understand the need for 4b. so I played around with something different, which is quiet helpful and illustrative if someone wants to seriously code stuff like this.

the PDB files can be found at an FTP server of a not for profit or governmental site. there are some 300.000+ files with new and updated files regularly added. so, it would be nice to find a way to autmatically download them if needed and kind of cache them locally.

moreover the files are compressed, which is a good idea because even in compressed form the whole data set makes up 70 something GB, which tsakes some time to download and might get you into trouble with your internet provider. so, it would be nice to be able to work with zipped files on the fly.

following is the solution. a little function which checks locally if the file has already been downloaded earlier. if yes, it uses this file if no, it downloads the file in question, saves it locally, unzips it and searches for the sequence as usual.


function loadPDB($inFile){
    
  $ftp_server    = "ftp.wwpdb.org";
  $ftp_user_name = "anoynmous";
  $ftp_user_pass = "adsad@adsad";
  $pdb_path      = "C:\some_path\\ents\\";
  $localFile     =  $pdb_path . $inFile;
  
  if(!file_exists($localFile)) {
    $cid           = ftp_connect($ftp_server);
    $login_result  = ftp_login($cid, $ftp_user_name, $ftp_user_pass);
    ftp_chdir($cid, "/pub/pdb/data/structures/all/pdb");

    if (!ftp_get($cid, $localFile, $inFile, FTP_BINARY)) {
      ftp_close($cid);
      return null; // a problem occured
    }
    ftp_close($cid);
  }  
  $ifp   = gzopen($localFile, "r");
  $inarr = split("\n", gzread($ifp, 100000000));
  return $inarr;
}

the earlier script has to be adopted like follows


<?
function loadPDB($inFile){
 ...
}

// ------- sequence to find
$seq2find = "IDKIKIAADVIRNGGTVAF";

// ------- loading PDB file
$inlines = loadPDB("pdb1kna.ent.gz");

// ------- loading amino acid map
....

This last snippet would search the sequence IDKIKIAADVIRNGGTVAF in pdb1kna.ent.gz no matter if it were already downloaded or still to be downloaded from the ftp site.

have fun.

Former Member · ‎06-24-2008

here's a solution to 4a. to keep the coding compact, we put the 'amino acid map' into an external file and read it in at runtime. The layout is expected to be like:


D:ASP:aspartic acid
E:GLU:glutamic acid
F:PHE:phenylalanine
...

The following script should still be self explaining and therefore is only minimally commented. It features file input, some nice PHP functions like array_flip() and a regular expression to identify and parse the atom lines.


<?
// ------- sequence to find
$seq2find = "IDKIKIAADVIRNGGTVAF";

// ------- loading PDB file
$inlines = split("\n", file_get_contents("C:\some_path\\pdbs\\pdb2eqa.ent"));

// ------- loading amino acid map
$acids_in  = split("\n", file_get_contents("C:\some_path\\cfgs\\amino_acids.cfg"));
foreach($acids_in as $acf) {
  $acid_tmp = explode(":", $acf);
  $acid_keys[$acid_tmp[0]] = $acid_tmp[1];
}

// ------- flip amino acid map
$acid_keys_rev[] = array_flip($acid_keys);

$seq = ""; $old = 0; $spos[0] = '---';

// ------- the pattern of target lines in the PDB files
$atom_pattern = "/^ATOM\s*(\d*)\s*([A-Z0-9]*)\s*([A-Z]{3})\s*[A-Z]*\s*(\d*).*/i"; 

// ------- create a primary structure string for the PDB file
foreach($inlines as $inline) {
  preg_match($atom_pattern, $inline, $matches);  
  if (isset($matches[3])) {
    if($old <> $matches[4] ) {
      $seq   .= $acid_keys_rev[0][$matches[3]]; 
      $spos[] = $matches[1];
    }
    $old = $matches[4];
  }
}

// ------- find sequence to find in primary structure
$startline = $spos[strpos($seq, $seq2find)+1];
$endline   = $spos[strpos($seq, $seq2find)+ strlen($seq2find)+1];

// ------- collect PDB's atom lines according to the sequence to find
if (isset($startline) && $startline <> "---") {
  foreach($inlines as $inline) {
    preg_match($atom_pattern, $inline, $matches);  
    if (isset($matches[3])) {
      if ($matches[1]>=$startline && $matches[1]<=$endline-1) {
        echo $inline . "\n";
      }
    }
  }
}
else {
  echo "Sequence not found!";
}
?>

If someone out there knows something about the intermittent HETATM lines, educate me

anton

New bioinformatic coding problems (4a-b) added to WIKI

Accepted Solutions (0)

Answers (2)

Answers (2)

Re: ASEGURAMIENTO DE TABLAS

Re: Getting error while Creating role in BRM

Re: How to enable the add, delete,edit system on s...

Re: CJS-30252 Running priviledged function stop of...

Successfactors login issue.