Split the sequences having significant BLAST hits

To perform sequences similarity search, we usually use NCBI BLAST. However, we often need to parse the result to split the sequences which have significant and/or insignificant hits, respectively. Here is the sample to do this work via bioperl, I hope it will supply useful information to you. Thank you!

#!/usr/bin/perl -w

use strict;
use Bio::SearchIO;

my $query = "";
my $SeqId = undef;
my %hit = ();
my %Seq = ();
my %Des = ();

open (PNI, "seq.fasta");
while (my $line =
) {
if ($line =~ /^>(\S+)\s*.*$/) {
$SeqId = $1;
$Des{$SeqId} = $line;
} else {
$Seq{$SeqId} .= $line;
}
}
close PNI;

open (NO, ">nohits.fasta");
open (HIT, ">hits.fasta");

my $in = new Bio::SearchIO(
-format => 'blast',
-file => 'blast.txt');

while(my $result = $in->next_result) {
$query = $result->query_name;
while (my $hit = $result->next_hit) {
$hit{$query} = $hit->name;
}
if (!defined $hit{$query}) {
#$hit{$query} = "No hits found";
print NO $Des{$query};
print NO $Seq{$query};
} else {
print HIT $Des{$query};
print HIT $Seq{$query};
}
}

close NO;
close HIT;

Posted in Bioinformatics | Tagged , , , , | Leave a comment

Build User Query System for DNA/Protein Sequence by PHP and MySQL

In fact, I am a beginner of PHP and MySQL, so this widget may have bugs and network security problem. However, this is my first try of PHP and MySQL. Although some one have developed much more powerful model, I’d like to share my naïve user query system, and I call it “naive”.

To built a database for DNA/protein sequences querying with user friendly inferface, firstly we need to reformat our sequences of fasta/genbank format into tab-delimited file, this should be easy work we can deal with Perl or something else, here is my demo file.

http://icethink.com/widget/naive/seqbase.dat

And secondly, create database and table by MySQL, the following are the demo SQL language.
CREATE TABLE `seqbase` (
`auto_seq` int(10) unsigned NOT NULL AUTO_INCREMENT,
`seqb_id` varchar(40) NOT NULL DEFAULT '',
`organism` varchar(100) DEFAULT NULL,
`sequence` blob,
`comment` longtext,
PRIMARY KEY (`auto_seq`),
UNIQUE KEY `seqb_id` (`seqb_id`)
) ENGINE=MyISAM AUTO_INCREMENT=1009 DEFAULT CHARSET=latin1;

Thirdly, load sequence file into MySQL, use the following command:
LOAD DATA LOCAL INFILE "seqbase.dat" INTO TABLE seqbase;
(Suppose you sequence file is stored as “seqbase.dat” in current directory, usually is your home directory)
Now the database have be created, the follow thing is to develop an user interface for querying, the demo pages of query and result could be obtained from the source page as follow:
http://icethink.com/widget/naive/
The following image are the screen shot of query and result pages, please enjoy discuss to how improve it.

Posted in Bioinformatics | Tagged , , , , , , , , | Leave a comment

Let’s make a deal by Perl

Let’s make a deal is a fashion TV game; here I designed a Perl script to simulate this game and print the result of the game. In fact, this is some basic about Bayesian theory.

#!/usr/bin/perl -w

#Let's make a deal

use strict;

my $award = undef;
my $host = undef;
my $guest_1st = undef;
my $guest_2nd = undef;
my $count = 0;
my $strategy = $ARGV[0] || "1"; #strategy belongs to {1,2} = {swtich,not switch}
my %strategy = ("1" => "swtich",
    "2" => "not switch");
my $time = $ARGV[1] || "10000";

for (my $i = 0; $i < $time; $i++) {
 #Set award
 $award = int(3 * rand);
 #Guest guess
 $guest_1st = int(3 * rand);
 #Host opens the box (non-award)
 ($host) = judge($award, $guest_1st);
 #Guest guess again
 $guest_2nd = guess($guest_1st, $host, $strategy);
 #counts
 if ($guest_2nd == $award) {
  $count++;
 }
}

my $right = $count / $time;

print "By ", $strategy{$strategy}, "ing, you won ", 100 * $right, "\%\.\n";

sub guess{
 my ($guest_1st, $host, $strategy) = @_;
 if ($strategy == 1) {
  $guest_2nd = 3 - $guest_1st - $host;
 } elsif ($strategy == 2) {
  $guest_2nd = $guest_1st;
 } else {
  print "Are you kid me?\n";
 }
 return $guest_2nd;
}

sub judge {
 my ($award, $guest_1st) = @_;
 my $host = undef;
 if ($award == $guest_1st) {
  $host = 3 - $award - dice($award);
 }
 if ($award != $guest_1st) {
  $host = 3 - $award - $guest_1st;
 }
 return $host;
}

sub dice {
 my ($award) = @_;
 my $dice = rand;
 if ($award == 1) {
  if ($dice <= 0.5) {
   $dice = 0;
  } else {
   $dice = 2;
  }
 } elsif ($award == 2) {
  if ($dice <= 0.5) {
   $dice = 0;
  } else {
   $dice = 1;
  }
 } else {
  if ($dice <= 0.5) {
   $dice = 1;
  } else {
   $dice = 2;
  }
 }
 return $dice;
}

Posted in Uncategorized | Tagged , , , | 8 Comments

Use hash to remove redundance of a table

In some of our daily bioinformatics works, we need to remove the redundancy of a table. Here we use Perl to do this. Remove redundancy by some fields are the same things, so I just show you how to perform this by Perl, of course you can do this by the system command “sort” (in UNIX/Linux). And the basic principle is that the key of each hash element is unique.

#!/usr/bin/perl -w

use strict;

#Create/Open a table with redundant.
open (TAB, ">data.table") or die "Cannot open file data.table: $!\n";

print TAB "1 2 3 4\n";
print TAB "1 2 3 4\n";
print TAB "2 3 4 5\n";
print TAB "1 2 1 4\n";
print TAB "2 3 4 5\n";
print TAB "1 2 3 4\n";

close TAB;

#Remove Redundant Records.
open (TAB, "data.table") or die "Cannot open file data.table: $!\n";
open (NRT, ">table.nr") or die "Cannot open file table.nr: $!\n";

my @table = ();
my $i = 0;
while (my $line = <TAB>) {
#Load Table.
$table[$i] = $line;
$i++;
}
my @table_nr = ();
my %control = ();
foreach (@table) {
#Use Hash to Remove Redundance.
if (!$control{$_}) {
push (@table_nr, $_);
}
$control{$_}++;
}

#Output the non-redundant table.
print NRT "@table_nr\n";

close NRT;
close TAB;

Posted in Bioinformatics | Tagged , , , , , , | Leave a comment

[PLoS Comput Biol] Understand bioinformatics in episteme and techne

In the recently publication on PLoS Comput Biol, a perspective paper explored the root of bioinformatics, mainly discussed the episteme and techne of bioinformatics, supply us a general view and understand of bioinformatics. I am attracted by the author’s logic, in particular, David explained bioinformatics and computational biology as the science informs the tools and the tools enable the science. Actually, I am very agree with his explains of broadly bioinformatics in this paper, sciences and tools is one in the explore of life sciences, or any other discipline.
The citation of this PLoS publication:
Searls DB (2010) The Roots of Bioinformatics. PLoS Comput Biol 6(6): e1000809. doi:10.1371/journal.pcbi.1000809.
For the PDF link of this paper, click here.

Posted in Bioinformatics, Journal Club | Tagged , | Leave a comment

Convert bowtie map file into bed and wig format

Bowtie is a great software for mapping the reads produced by next generation sequencing into genome, however, in many cases, we need to convert its map out into bed or wig format to feed orther applications. Here we supply a simple perl script can doing this, please enjoy!


#!/usr/bin/perl -w
# Filename: map2bedwig.pl
# Convert bowtie map file into *.bed and *.wig file
# Note: reads are in fasta format as follow:
# >seqid1_x56
# CAGATCGATCGATCGAGCGCGACT
# >seqid2_x48
# GTCGATGAGCGACCGCGCGCGCGG
# Here 56 and 48 after x are the reads numbers.
# qinhu.wang (at) gmail.com

use strict;

my $usage = ”
Usage: $0
MAP: bowtie map file without –suppress
SEP: ID Seperator of reads_id and reads_num\n”;

print $usage;

my $map = $ARGV[0] || “bowtie.map”;
my $sep = $ARGV[1] || “_x”;
my %bed = ();
my %wig = ();

open (MAP, $map) or die “Cannot open file $map: $!\n”;
while (<MAP>
) {
my ($index, $strand, $refseqid, $start, $read, $discard) = split /\t/, $_, 6;
my ($idx, $num) = split /$sep/, $index;
my $end = $start + length($read);
$bed{$index} = $refseqid . “\t” . $start . “\t” . $end . “\t” . $strand;
$wig{$index} = $refseqid . “\t” . $start . “\t” . $end . “\t” . $num;
}
close MAP;

open (BED, “>map.bed”) or die “Cannot create file map.bed: $!\n”;
open (WIG, “>map.wig”) or die “Cannot create file map.wig: $!\n”;
foreach (sort keys %bed) {
print BED $bed{$_}, “\n”;
print WIG $wig{$_}, “\n”;
}
close BED;
close WIG;

__END__

Posted in Bioinformatics | Tagged , , , , , | 1 Comment

[Nature Biotechnology] Compare of the four multiple genome sequences alignmet tools

Pairwise sequence alignment is almost a solved problem, while multiple sequence alignment tools are still need to be improved. When sequence alignments are applied to genome-wide, it become more difficult. However, there are several tools available now. Xiaoyu Chen & Martin Tompa compared TBA, MAVID, MLAGAN and Pecan in 28 vertebrates, and the result indicate that the performance specially in non-coding regions and distantly related species are under challenge.

The original ABSTRACTS of the nature biotechnology paper:
Multiple sequence alignment is a difficult computational problem. There have been compelling pleas for methods to assess whole-genome multiple sequence alignments and compare the alignments produced by different tools. We assess the four ENCODE alignments, each of which aligns 28 vertebrates on 554 Mbp of total input sequence. We measure the level of agreement among the alignments and compare their coverage and accuracy. We find a disturbing lack of agreement among the alignments not only in species distant from human, but even in mouse, a well-studied model organism. Overall, the assessment shows that Pecan produces the most accurate or nearly most accurate alignment in all species and genomic location categories, while still providing coverage comparable to or better than that of the other alignments in the placental mammals. Our assessment reveals that constructing accurate whole-genome multiple sequence alignments remains a significant challenge, particularly for noncoding regions and distantly related species.
For the PDF link of this paper, click here.

Posted in Bioinformatics, Journal Club | Tagged , , | Leave a comment

How to read and write a sequence in fasta format by Perl?

Well, fasta format is defined as a sequence begin with “>”, followed by its identifier in the first line, and the sequence itself is stored in other lines. The sequence can be multiple and it is one of the most popular sequence format in bioinformatics. For large scale sequence analysis, the basic thing is to read and write sequence in fasta format, here we supply the basic way to aid your daily work, note it is not the only way!

Read sequence in fasta format:

sub read_fasta {
my(@fasta_file) = @_;
# Declare and initialize variables
my $sequence = '';
foreach my $line (@fasta_file) {
# discard blank line
if ($line =~ /^\s*$/) {
next;
# discard comment line
} elsif($line =~ /^\s*#/) {
next;
# discard fasta header line
} elsif($line =~ /^>/) {
next;
# keep line, add to sequence string
} else {
$sequence .= $line;
}
}
# remove non-sequence data (in this case, whitespace) from $sequence string
$sequence =~ s/\s//g;
return $sequence;
}

Write sequence in fasta format:

sub write_fasta {
my($sequence, $length) = @_;
# Print sequence in lines of $length
for ( my $pos = 0 ; $pos < length($sequence) ; $pos += $length ) {
print substr($sequence, $pos, $length), "\n";
}
}

Posted in Bioinformatics | Tagged , , | 6 Comments