2014-10-31

BLAST! No frequency ratios needed for composition-based statistics

While working on updating the NCBI BLAST+ wrapper for Galaxy for any changes in the new BLAST+ 2.2.30 release, I hit a cryptic error message from deltablast

$ deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb /data/blastdb/cdd_delta
BLAST engine error: /data/blastdb/cdd_delta contains no frequency ratios needed for composition-based statistics.
Please disable composition-based statistics when searching against /data/blastdb/ncbi/cdd/cdd_delta.

To cut a long story short, to fix this you need to download and unpack a newer cdd_delta.tar.gz which now includes another file cdd_delta.freq containing frequency ratio information which the newer deltablast tool requires.

The same applies to the rpsblast tool, although here you just get a warning rather than an error:

$ rpsblast -query four_human_proteins.fasta -db /data/blastdb/cdd_delta -evalue 1e-08 -outfmt "6 qseqid sseqid score"
Warning: /data/blastdb/cdd_delta contain(s) no freq ratios needed for composition-based statistics.
RPSBLAST will be run without composition-based statistics.
sp|Q9BS26|ERP44_HUMAN    gnl|CDD|222416    401
...
sp|P06213|INSR_HUMAN    gnl|CDD|238021    137
sp|P08100|OPSD_HUMAN    gnl|CDD|215646    411

For the full story, I am using two small sample files rhodopsin_proteins.fasta and four_human_proteins.fasta as test cases. Using BLAST+ 2.2.26 through 2.2.29, this example worked:

$ ~/ncbi_blast_2.2.29+/deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb /data/blastdb/cdd_delta
gi|57163783|ref|NP_001009242.1|    sp|P08100|OPSD_HUMAN    826
gi|3024260|sp|P56514.1|OPSD_BUFBU    sp|P08100|OPSD_HUMAN    767
gi|283855846|gb|ADB45242.1|    sp|P08100|OPSD_HUMAN    718
gi|283855823|gb|ADB45229.1|    sp|P08100|OPSD_HUMAN    721
gi|223523|prf||0811197A    sp|P08100|OPSD_HUMAN    842
gi|12583665|dbj|BAB21486.1|    sp|P08100|OPSD_HUMAN    795

The error message from BLAST+ 2.2.30 was a bit cryptic, but suggested the domain database format had changed. I was using quite an old copy of the cdd_delta database from November 2013, so I downloaded the current version of cdd_delta.tar.gz (dated 24 Oct 2014, verified  MD5 checksum 0a5513e147aa320264a1414f8194cfbc as per cdd_delta.tar.gz.md5).

Now deltablast from BLAST+ 2.2.30 works, although the bit scores (and other details of the alignments) are slightly different.

$  ~/ncbi_blast_2.2.30+/2.2.30+/deltablast -query rhodopsin_proteins.fasta -subject four_human_proteins.fasta -evalue 1e-08 -outfmt "6 qseqid sseqid score" -rpsdb cdd_delta
gi|57163783|ref|NP_001009242.1|    sp|P08100|OPSD_HUMAN    822
gi|3024260|sp|P56514.1|OPSD_BUFBU    sp|P08100|OPSD_HUMAN    759
gi|283855846|gb|ADB45242.1|    sp|P08100|OPSD_HUMAN    714
gi|283855823|gb|ADB45229.1|    sp|P08100|OPSD_HUMAN    718
gi|223523|prf||0811197A    sp|P08100|OPSD_HUMAN    839
gi|12583665|dbj|BAB21486.1|    sp|P08100|OPSD_HUMAN    790

So what changed? The new database contained an extra file, cdd_delta.freq - so for anyone else stumped by the error message "BLASTDB contains no frequency ratios needed for composition-based statistics" you need to check if there is a file named BLASTDB.freq present. 

No comments:

Post a Comment