Update May 11: Daniel Spångberg now provided a few even more optimized versions: DS Tbl: a table optimized version of the line-by-line reading, as well as optimized versions of the ones that read the whole file at once. On the fastest one, I also tried with -funroll-all-loops to gcc, to see how far we could get. The result (for a 58MB text file): 66 milliseconds!
Update May 16: Thanks to Franzivan Bezerra's post here, I now got some Go code here as well. I modified Francivan's gode a bit, to comply with our latest optimizations here, as well as created a table optimized version (similar to Daniel Spångberg's Table optimized C code). To give a fair comparison of table optimization, I also added a table optimized D version, compiled with DMD, GDC and LDC. Find the results in the first graph below.
Update II, May 16: The link to the input file, used in the examples, got lost. It can be downloaded here!
Update III, May 16: Thanks to Roger Peppe, suggesting some improved versions in this thread on G+, we now got two more, very fast Go versions, clocking even slightly below the idiomatic C++ code. Have a look in the first graph and table below!
Update IV, May 16: Now applied equivalent optimizations like Roger's ones for Go mentioned above, to the "D Tbl" versions (re-use the "line" var as buffer to readln(), and use foreach instead of for loop). Result: D and Go gets extremely similar numbers: 0.317 and 0.319 was the best I could get for D and Go, respectively. See updated graphs below.
Update May 17: See this post on how Francivan Bezerra acheived same performance as C with D, beating the other two (Go and Java) in the comparison. (Direct link to GDoc)
My two previous blog posts (this one and this one) on comparing the performance of a number of languages for a simple but very typical bioinformatics problem, spurred a whole bunch of people to suggest improvements. Nice! :) Many of the comments can be read in the comments on those posts, as well as in the G+ thread, while I got improvements sent by mail from Thomas Hume (thomas.hume -at- labri.fr) and Daniel Spångberg.
Anyway, here comes an updated post, with a whole array of improvements, and a few added languages.
One suggestion I got (by Thomas Koch, I think), was applicable to all the code examples that I got sent to me, namely to count GC, and AT separately, and add to the total sum only in the end. It basically cut a few percent of all execution times, so I went through all the code variants and added it. It did improve the speed in a simipar way for all the codes.
Many people suggested to read the whole file in once instead of reading line by line though. This can cause problems with the many very large files in bioinformatics, so I have divided the results below in solutions that read line by line, and those that read more than that.
Among the code examples I got, was "our own" Daniel Spångberg of UPPMAX, who wrote a whole bunch of C variants, even including an OpenMP parallelized variant, which takes the absolute lead. Nice! :)
More credits in the explanation of the codes, under each of the results sections.
Ok, so let's go to the results:
Here are first the codes that read only line by line. I wanted to present them first, to give a fair comparison.
| Python | 1.692 |
|---|---|
| Cython | 1.684 |
| Go (8g) | 1.241 |
| Go (gcc) | 0.883 |
| Go Tbl (gcc) | 0.785 |
| Go Tbl (8g) | 0.767 |
| PyPy | 0.612 |
| FPC (MiSchi) | 0.597 |
| D (GDC) | 0.589 |
| FPC | 0.586 |
| FPC (leledumbo) | 0.567 |
| D (DMD) | 0.54 |
| D Tbl (DMD) | 0.492 |
| D (LDC) | 0.479 |
| C++ (HA) | 0.385 |
| D Tbl (GDC) | 0.356 |
| Go Tbl (RP) (8g) | 0.337 |
| Go Tbl (RP) Ptr (8g) | 0.319 |
| D Tbl (LDC) | 0.317 |
| C (DS) | 0.276 |
| C (TH) | 0.252 |
| C (DS Tbl) | 0.183 |
All of the below codes have have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.
Click the language code to see the code!
Here are the codes that don't keep themselves to reading line by line, but do more than so. For example Gerdus van Zyl's pypy-optimized code reads a bunch of lines at a time (1k seems optimal), before processing them, while still allowing to process in a line-by-line fashion. Then there is Daniel Spångberg's exercise in C performance, where by reading the file in at once, and applying some smart C tricks, and even doing some OpenMP-parallelization, cuts down the execution time under 100 ms, and that is for a 1million line file, on a rather slow MacBook air. Quite impressive!
| PyPy (GvZ) | PyPy | PyPy (GvZ) + Opt | C (DS Whole) | C (DS Whole Tbl) | C (DS Whole Tbl Par) | C (DS Whole Tbl 16bit) | C (DS Whole Tbl Par 16bit) | C (DS Whole Tbl Par 16bit Unroll) |
|---|---|---|---|---|---|---|---|---|
| 0.862s | 0.626s | 0.551s | 0.178s | 0.114s | 0.095s | 0.084s | 0.070s | 0.066s |
Note: All of the below codes have been modified to benefit from the tip from Thomas Koch (thomas -at- koch.ro), to cound AC and GC separately in the inner loop, and sum up only in the end.
I think there are a number of conclusions one can draw from this:
What do you think?
Update May 16: More optimizations and languages have been added here, and here (last one includes Python, D, FreePascal, C++, C and Go).
I'm testing to implement some simple bioinformatics algos in both python and D, for comparison.
My first test, of simply calculating the GC content (fraction of DNA letters G and C, as compared to G, C, T and A), reveals some hints about where D shines the most.
Based on this very limited test, D:s strength in terms of performance, seems to be more clear when you hand-code your inner loops, than when using some of the functionality of the standard library (phobos).
When using the countchars function in std.string of D:s standard library, D is on par with, but a tiny bit slower than python. When I hand-code the inner loop more carefully though, I get a 10x speedup of the D code, over my python code.
As test data, I use a 58MB Fasta file, containing approximately 1 million (989561) lines of DNA sequence.
As a reference, I wrote this little python script. Probably there is a faster way to write it, but this is the fastest I could come up with.
import re import string def main(): file = open("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r") gcCount = 0 totalBaseCount = 0 for line in file: line = line.strip("\n") if not line.startswith(">"): gcCount += len(re.findall("[GC]", line)) totalBaseCount += len(re.findall("[GCTA]", line)) gcFraction = float(gcCount) / totalBaseCount print( gcFraction * 100 ) if __name__ == '__main__': main()
This takes around 8 seconds on my machine:
[samuel gc]$ time python gc_test.py 37.6217301394 real 0m7.904s user 0m7.876s sys 0m0.020s
Then, I tried two implementations in D. First I tried to make my life as simple as possible, by using existing standard library functionality (the countchars function in std.string):
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line = ""; while (!file.eof()) { line = chomp(file.readln()); if (!startsWith(line, ">")) { gcCount += countchars(line, "[GC]"); totalBaseCount += countchars(line, "[GCTA]"); } } float gcFraction = ( cast(float)gcCount / totalBaseCount ); writeln( gcFraction * 100 ); }
The results: slightly slower than python with DMD, and slightly faster with GDC and LDC:
[samuel gc]$ dmd -ofgc_slow_dmd -O -release -inline gc_slow.d
[samuel gc]$ gdmd -ofgc_slow_gdc -O -release -inline gc_slow.d
[samuel gc]$ ldmd2 -ofgc_slow_ldc -O -release -inline gc_slow.d
[samuel gc]$ for c in gc_slow_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_slow_dmd:
37.6217
real 0m8.088s
user 0m8.049s
sys 0m0.032s
---
gc_slow_gdc:
37.6217
real 0m6.791s
user 0m6.764s
sys 0m0.020s
---
gc_slow_ldc:
37.6217
real 0m7.138s
user 0m7.096s
sys 0m0.036sOn the other hand, when I hand code the string comparison code, like this:
import std.stdio; import std.string; import std.algorithm; import std.regex; void main() { File file = File("Homo_sapiens.GRCh37.67.dna_rm.chromosome.Y.fa","r"); int gcCount = 0; int totalBaseCount = 0; string line; char c; while (!file.eof()) { line = file.readln(); if (line.length > 0 && !startsWith(line, ">")) { for(uint i = 0; i < line.length; i++) { c = line[i]; if (c == 'C' || c == 'T' || c == 'G' || c == 'A') { totalBaseCount++; if ( c == 'G' || c == 'C' ) { gcCount++; } } } } } float gcFraction = ( cast(float)gcCount / cast(float)totalBaseCount ); writeln( gcFraction * 100 ); }
... then I get some significant speedup over python, around 10x (13x with the fastest combination of compiler and optimization flags)!:
[samuel gc]$ dmd -ofgc_fast_dmd -O -release -inline gc_fast.d
[samuel gc]$ gdmd -ofgc_fast_gdc -O -release -inline gc_fast.d
[samuel gc]$ ldmd2 -ofgc_fast_ldc -O -release -inline gc_fast.d
[samuel gc]$ for c in gc_faster_{dmd,gdc,ldc}; do echo "---"; echo "$c:"; time ./$c; done;
---
gc_faster_dmd:
37.6217
real 0m0.652s
user 0m0.624s
sys 0m0.024s
---
gc_faster_gdc:
37.6217
real 0m0.668s
user 0m0.644s
sys 0m0.020s
---
gc_faster_ldc:
37.6217
real 0m0.538s
user 0m0.520s
sys 0m0.016sFor you next gen sequencing bioinformaticians interested in getting some hands on new cloud based technologies for computation, mainly around the hadoop framework, and being around in Uppsala at the end of May 2012, may want to have a look at this:
As the event page on UPPMAX states:
"The hackathon will focus on next challenges that cloud adoption poses: massively distributed data processing frameworks such as Hadoop, distributed cloud databases and distributed bioinformatics applications."
Based on my experiences from a very useful (as in getting new hands on experience) and interesting (as in making new contacts) hackathon at CSC in Helsinki, Finland, I am sure this will be a highly interesting and useful hackathon as well, for all who are faced with the challenges of big sequencing data.
Apply before April 30 to get (EU/COST action: SeqAhead) funding! See you in Uppsala at the end of May!
This blog has been silent for a while and someone might wonder what I've been doing.
One answer is: Developing a graphical client for non-linux-experienced users to connect securely to a computer cluster and configure batch jobs for common bioinformatics software. The project is financed within the UPPNEX project, and so the focus is foremost analysis of Next Generation Sequencing data, but the client will be fully capable to use for any software installed on the cluster.
The client meets a rising need in the next generation sequencing community, since biologists generally have far less experience with *nix systems and programming, than, say physicists, while the vast amounts of sequencing data increases the need to use large scale computing resources such as the ones provided in the UPPNEX project.
I demonstrated a proof-of-concept version at the UPPNEX-SciLifeLab bioinformatics forum on Feb 22nd, and the slides are now available:
As can be seen in the slides, the client is based on the very capable Bioclipse platform.
Dr. Reinhard Schneider from the European Molecular Biology Laboratory held a lecture at BMC in Uppsala with the title seen above. It seemed quite relevant to the stuff I'm currently doing at Science for Life Laboratory (where I'm employed for 2 months), investigating LIMS systems for NextGen sequencing data, as well as learning about analysis tools in the area.
What Reinhard presented was four different tools that they have developed/are working on, which tries to solve some of the problems of grasping heterogenous data sources. From the lecture info:
Recent comments
1 week 16 hours ago
1 week 5 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago
1 week 6 days ago