I'm a M.Sc. Student (grad in Aug 2010) in Biotechnology, at Uppsala University, focusing on Semantic Web, Bioinformatics and Systems Biology. I'm also a MediaWiki and Drupal geek, more or less =D.

This blog is currently used for documentation of the following projects:

-- Samuel Lampa - firstname.lastname@gmail.com

Automatically abbreviate authors first names in bibtex

I had all my references (50+) with all author names spelled out, but was told to abbreviate first names to one letter (Indeed that looks better IMO too). I didn't want to do that manually, and found out that this can automatically be done by using the bibtex style "abbrv". So I set it in my latex document with

\bibliographystyle{abbrv}

GSoC Project accepted

Just got to know that my proposed GSoC 2010 project: "General RDF export/import in Semantic MediaWiki" (as documented here), was accepted! That's some good news! :). Mentor will be Denny Vrandečić from the Semantic MediaWiki community.

Surely the project is going to be a challenge but it is a highly motivating one so I'm much looking forward to it, to hopefully, together with my mentor and the community, to solve things, and to learn a lot.

I posted a (slightly shortened) copy of the project proposal and my bio here.

The project will be continuously documented here on this blog, so keep an eye here if you are interested (Use the GSoC2010 tag to filter out relevant posts). Community discussion will likely happen at the SMW-Devel mailing list, and if you want to contact me directly, you can do that at samuel dot lampa at gmail dot com or skype samuel_lampa.

My current status/schedule is:

  • This week: Very busy, finishing thesis report.
  • Next 2 weeks (though starting a little this week): Get dev. environment up running (leaning towards Eclipse with PDT) and looking at code
  • 12/5: Briefing with Denny
  • Up until 9th: Very busy period with exams on 24/5 and 8/6.
  • On 9th: Start coding! (So coding start will be a little delayed, but will make up for that no worries! :) (not to used to having spare time anyway))

GSoC Proposal: "General RDF export/import in Semantic MediaWiki"

This is a slightly shortened version of the full Proposal, iniially posted on my user page on MediaWiki.org, and then in final form on the GSoC app site.

Browse semantic data in parallell

The semantic web field has seemed quite void of successful general user interfaces to browse semantic data in an efficient way (SPARQL querying is not really for everybody and their aunt). An interesting approach is the Freebase Parallax which lets you continuously views sets of data all while you narrow down or extend your search criteria, thus "browsing in parallell". Seems to make a lot of sence in its simplicity.

3rd Project Update (Integrating SWI-Prolog for Semantic Reasoning in Bioclipse)

I just had my 3rd, and last project update presentation (before the final presenation on April 28th), presenting results from comparing the performance of the integrated SWI-Prolog against Jena and Pellet, for a spectrum similarity search query. Find the sldes below.

On larger datasets and more peaks in the query, Prolog is the fastest

In a previous performance test I compared a NMR Spectrum similarity search programmed in Prolog and run in SWI-Prolog integrated in Bioclipse, against a SPARQL Query doing the same task, run in Jena, and Pellet, also integrated in Bioclipse.

In that test Jena was the fastest, outperforming Prolog. The test had flaws though, as reported here with new tests added, though these new results were still not giving a clear winner.

This changed when I turned to larger datasets, testing on the range 2000-25000 spectra (where 25000 spectra corresponds to around a million RDF triples), instead of 10-100, which is a lot closer to the size of all spectra in NMRShiftDB (36419 searchable spectra as of March 29, 2010). I also used 16 instead fo 3 peak shift values in the search query. When testing with these changes, Prolog clearly took the lead.

(Bioclipse scripts, including the SPARQL Query, and the Prolog code, attached. See below).

This time I also included tests with Jena using both the in-memory RDF Store, and the Jena TDB disk based one. Interesting is that Jena run against the TDB disk based store is about double as fast as against the in-memory store! (I had to exclude the combination Pellet/TDB store, as it didn't complete for more than an hour, after which I stopped the run.)

To be noted, in these tests I did also take more measures as to avoid memory and/or caching related bias. For example, I did not rerun tests shortly after each other in a loop, but restarted Bioclipse after each run. This kind of testing took a lot more time, and I have so far only had time to do three iterataions for each specific size of dataset. Error bars, indicating the standard deviation, are included though to give an idea of the variation among the three. Hopefully I will have time to complement with a few more runs per measure point.

See figures below for results. I've included one figure where Pellet is included, and one where it is skipped, in order to get a zoom in on the Jena/Prolog difference.

 

...and, Pellet excluded:

 

Correction of flawed results: Close competition between Jena and Prolog

UPDATE 29/3: See new results here

I reported in a previous blog post (with a bit of surprise) that Jena clearly outperformed SWI-Prolog for a NMR Spectrum similarity search run inside Bioclipse. I have now realized that indeed these previous results were flawed for a number of reasons.

Semantic web for LAMP systems

Just aquainted myself a tiny bit with ARC, RDF Classes for PHP. Might be useful in the Bioclipse / Semantic MediaWiki integration that is in the thoughts for the coming months. ARC "framework", or RDF API used by RDF modules in Drupal, and has been talked about being the substitute for the currently used RAP framework in Semantic MediaWiki (used if one wants to set up a SPARQL endpoint for a Semantic MediaWiki).

On a different note, there are a bunch of nice web apps built on ARC over at semsol.com, most notably trice, a semantic web application framework.

Querying multiple SPARQL endpoints from single query, with Jena SERVICE extension

Egon pointed to an interesting blog post about a feature that is available as a an extension to Jena, the semantic web framework available in Bioclipse. It allows to very easily query multiple SPARQL endpoints from a single SPARQL query (using the SERVICE keyword), and use variable bound from one endpoint when querying the next.

This is very useful in general. I was also thinking of the specific scenario (along the lines we have partly already been thinking) to use multiple Semantic MediaWikis as community maintained databanks, for querying back into Bioclipse. Being able to use multiple MediaWiki installs is very useful because it is hard to incorporate a very efficient access restriction system in MediaWiki (due to the nature of how it works, with template calls and all), so then it is better to be able to have separate wikis for content which needs special restrictions.

Need to add "colon dash" before rdf_register_ns

I could not get the rdf_db:rdf_register_ns/2 function of SWI-Prolog's semweb package to work ... getting errors like "No permission to redefine static method ..." etc.

Now I finally figured out one has to add ":-" before the line with the rdf_db:rdf_register_ns predicate, like so:

:- rdf_db:rdf_register_ns(nmr, 'http://www.nmrshiftdb.org/onto#').

or ... inside Bioclipse:
swipl.loadPrologCode(":- rdf_db:rdf_register_ns(nmr, 'http://www.nmrshiftdb.org/onto#').");