Bioinformatician / Developer at UPPMAX HPC Center, Uppsala University. Interested in Semantic Web, Bioinformatics and Computational Systems Biology, Java, Python, Domain Driven Design, Model Driven Development, MediaWiki and Drupal.

This blog is currently mostly filled with documentation of two past projects:

-- Samuel Lampa - firstname.lastname@gmail.com

SMWCon Fall 2011 impressions

(I had this post in draft for too long. Time to publish, as is)

The Semantic MediaWiki conference Fall 2011 in Berlin is over, so time to summarize some thoughts and impressions.

It was my first SMWCon at all. A bit late regarding that I did a Google Summer of Code project for SMW in 2010, but my finances were kind of inexistent then. Happy to get the chance now though.

Before turning to some of the individual talks, just a note of two general things I found interesting (I wish I would have time to review each of them, since there was so much interesting stuff...):

  1. There were a remarkable amount of talks on connecting SMW with the rest of the Semantic Web, through RDF, SPARQL etc. Cool, SMW is seemingly becoming a natural choice of platform for SemWeb publishing!
  2. The proportion of bio-people were also a bit remarkable. Apart from SNPedia founder Mike Cariaso, there were a whole bunch of others, including Salvadore from the GeneWiki project, Toni .... (and me) .... I guess it reflects how good SMW handles the need to give structure to very heterogenous datasets, so typical for the Life Sciences.

The talks

Note that you can now find slides and videos for most of the talks, online:

The conference started with one tutorial day, followed by the main days for talks. The tutorials turned out to be so interesting though, so that most people seemed to attend them as well.

Find below my very brief notes/impressions on some talks that I found specially interesting, for my use cases and interests:

Very nice UIs

Daniel Hansh from OntoPrise showed off their SMW+ Community Edition package, which includes SMW, Halo and other extensions. This is quite cool stuff, with really helpful and slick UIs, so let's hope it will remain open source! :) (Slides, Video)

Performance optimizations in SMW

Markus Krötzchs talked on "Saving C02: Top SMW Performance Issues and How to Address Them". The slides are cram full of link to more detailed in formation, so this I'll have to study in more detail. (Slides , Video)

SMW reworkings for better RDF support

As said, there were lots of RDF talks on the conf. One of which is some reworkings of the SMW internals to support the RDF and SPARQL models better. Markus Krötzch gave an overview of the role between SMW and RDF in his talk "Connecting SMW to RDF Databases: Why, What, and How?" (Slides)

Keeping track of changes, even at the fact level!

Jeroen De Dauw presented a new extension: "Semantic Watchlist", to replace a number of (in Jeroens opinion, somehow lacking) extensions. Looks very welldone! (Slides, Video)

Powerful transforming data from RDF on-demand

William Smith, Christian Becker and Andreas Schultz presented some very cool sutff: "Neurowiki: How we integrated large datasets into SMW with R2R and Silk / LDIF". The LDIF framework seems to do a lot of what RDFIO does, but in a much bigger and more capable framework. Really cool stuff! (Slides 1, Slides part 2, Video)

SMW classes as XML

Yaron Koren presented an idea to store "classes" in SMW as XML/JSON in one single location, rather than as now, in three different places (Category, Template and Form). (Slides, Video)

Towards a Semantic Wikipedia!

Denny Vrandečić and Daniel Kinzler presented the WikiData extension, as part of their work to make a "Semantic Wikipedia" a reality ... and based on another nice demo-project they did: Shortipedia. This is gonna be hot stuff! (Slides, Video)

Linked Data - increasingly important topic for SMW

Anja Jentzsch from the LODD presented Linked Data, and the best practices for how to publish RDF data, so that you really "get connected" to the evolving Semantic Web. An increasingly important topic, as shown by the increased interest in connecting SMW with the outside world. (Slides, Video)

RDFIO / Hooking up SMW with external tools via SPARQL

My talk ... (as blogged earler)

More RDF to Wiki Page title mapping strategies

Michael Erdmann also presented how they do Data integration with SMW+ and OntoBroker. They interestingly use a similar strategy as RDFIO in order to nicefy wiki page titles. Interesting! ... maybe there is a way to consolidate all these efforts in a reusable way? (Slides, Video)

More RDF to Wiki Page title mapping strategies

Jeff Pan talked on "Tractable Reasoning". Very interesting! They focus on "making reasoning reasonable", that is, computationally feasible ... and seem to have succeeded as well, their REL reasoner has shown to totally outperform reasoners such as Pellet for more or less any kind of ontology, cool! (I found out it's quite easy to beat pellet though, earlier, with SPARQL/ARC, and especially with PROLOG). (Video)

More RDF to Wiki Page title mapping strategies

Markus Krötzsch and Jeroen De Dauw talked about the next steps for Semantic MediaWiki. Many great things happening: Foundation started, ... (Slides, Video)

Excel-like statistics in SMW

Benedikt Kaempgen demonstrated some supercool stuff, something like a pivot browser for Semantic MediaWiki, in order to get more "Excel-like" statistics in SMW. They use the Spark extension by Jeroen, to query SPARQL endpoints and similar stuff, from javascript. Supercool! (Slides, Video)

SMW as a semantic browser

Benedikt also showed of their Semantic Web browser, that only requires "Equivalend URIs" to be defined for pages, and then let's you browse the data in the wiki in it's original format. Interesting since that matches perfectly with RDFIO, in that RDFIO complements this with also RDF export and querying in original format, using basically the same strategy! (Slides, Video)

SNPedia

Mike Cariaso talked about what's new with SNPedia ... lot's of cool stuff (apart from how cool SNPedia is just in itself!) (Video)

Lightning talks

Not to forget, there were also a whole bunch of very interesting lightning talks ... too many for me to have time to cover here now. One thing you really should not miss though, is the SPARK extension by Jeroen De Dauw, to query SPARQL endpoints via javascript, for visualizations. Wow! (Slides).

Also, for you Bio-people readint this, the SNPedia + GeneWiki mashup, you'll probably find interesting! (Video)

Well, you better watch them all anyway, they are only 5 minutes each, and there's just too much good stuff there, so I can't cover it all:

My SMWCon Fall 2011 Talk now on YouTube

I blogged it earlier, but better to get everything in one post, so taking the summary again:

After doing my GSoC project for Wikimedia foundation / Semantic MediaWiki in 2010, resulting in the RDFIO Extension, I finally could make it to the Semantic MediaWiki Conference, which was in Berlin in September.

Now, the video of my talk, "hooking up Semantic MediaWiki with external tools via SPARQL" (such as Bioclipse and R), is out on YouTube, so please find it below. For your convenience, you can find the slides below the video, as well as the relevant links to the different stuff shown (click "read more" to see it all on same page).

One week off to work on RDFIO

From today evening, I'm taking one week off from work to make a sprint to try to finalize the RDFIO extension, for RDF import and export in Semantic MediaWiki.

This will be a required step for finalizing the vision described in my SMWCon fall 2011 talk, the other month.

I developed RDFIO as part of Google Summer of Code 2010, and it got into a working proof-of-concept state. Some issues, such as with performance, never were resolved though. Also it depended upon two other modules, which, after Semantic MediaWiki changed a lot of it's internals in version 1.6, have still not been updated to support these, leaving RDFIO in a state where it does not support SMW 1.6.

So, a little sprint is definitely needed, to get RDFIO in working condition again. At the same time I hope to look at it with fresh eyes, after having a lot more coding experience now, than when I coded RDFIO, after one year of quite some Java and Python development in my work at UPPMAX.

Things I plan to have a look at (or at least ponder):

  • Look at possiblity to use the Wiki Object Model instead of the Page Object Model / SMWWriter combo (which unfortunately is not SMW 1.6 compliant anymore)
  • See if/how I can make more use of the new infrastructure in SMW, summarized by Markus in this post in SMW devel mailinglist.
  • Take an overall fresh look at the architecture of the code ... try to follow Domain Driven Design principles much better, to get clean maintaineable code.
  • Use existing MediaWiki feature uch more, such as the HTMLForm form builder class (for Specialpages and the like).
    (More suggestions like this highly welcome!)
  • Import ARC2 library via the ARCLibrary extension rather than with a separate import.
  • Use the existing "Equivalent URI" special property, instead of the custom "Original URI" (don't remember why I created a custom one...)
  • Run big imports as jobs?
  • OWL class import (as categories) ?
  • Allow updating Wiki articles from any connected store, by using the new SMW Internals?
  • Other things?

I would love to get some feedback and input to the project during this intensive week, so don't hesitate to drop in at #semantic-mediawiki on irc.freenode.net (IRC chat) or in the SMW-devel mailing list! My contact options, summarized:

Looking forward to your input during this week!

Switched to Xubuntu

I was upgrading to Ubuntu 11.10 the other day, after sticking to the pre-Unity, Ubuntu 10.10. Now I thought the Unity stuff would have gotten better, and sure it has, but still I couldn't stand it ... so, switched to Xubuntu (sudo apt-get install xubuntu-desktop), which uses XFCE as desktop environment instead of GNOME.

And after some tweaking ... Wow! ... so snappy, so beatiful, so fast, so consistent! just wow ... It's definitely my distro of choice from now on ...

Have to throw a screenshot at you (sorry, haven't got the Lightbox up working yet):

For your referece:

New majority voted twitter hashtag for NextGen Sequencing: #deepseq

As I concluded in a question on Biostar, there has been no real consensus on a short, non-hijacked hashtag to use for "High-Througput sequencing" / "Next Generation Sequencing" on social media sites such as twitter and identi.ca.

After some community voting, a new winner turned out: #deepseq (click for twitter feed)

So, do spread the word, and start using it!

Grepping SQL dumps with endless lines? Use the fold command!

Grepping for stuff in MySQL dumps is not that nice, with miles-wide lines. You could send the grep output to a command such as "cut -c 1-200", but that would still not be guaranteed to give you the actual matched content.

Enter the "fold" command, which formats output into lines with a max count of chars:

grep "stuff" sqldump.sql | fold -w 200 | grep -C 1 "stuff"

... will give you a much better view of the context of the match!

(The first grep gets the (mile-wide) line that has the match, then fold will split the mile-wide line into 200 char long lines, and "grep -C 1" will show only the one 200 char wide line where the match is + 1 line of context before and after).

Tags:

My SMWCon Fall 2011 Talk

After doing my GSoC project for Wikimedia foundation / Semantic MediaWiki in 2010, resulting in the RDFIO Extension, I finally could make it to the Semantic MediaWiki Conference, which was in Berlin this week.

While I write up a longer review of the many interesting talks, you can in the meantime find the slides from my talk below, on "hooking up Semantic MediaWiki with external tools" (such as Bioclipse and R):

Links

  • For the SMW/Bioclipse Hookup there is a status update on my blog.
  • ... with a demo screencast.
  • More info on the RDFIO extension available on the Extension page
  • Code for the Bioclipse SMW module is available at github
  • Bioclipse website is at bioclipse.net
  • ... and the (SMW) Bioclipse wiki
  • The SMW/R hookup is not yet published in any journal, but this is what is available:
    • Egon Willighagen, who did it, has blogged it.
    • Also, the rrdf package he wrote, is is available in the CRAN, and there's a PDF available, describing it.

Essential screen flags and shortcuts

GNU Screen is a nice little program, allowing you to have "terminals" that you can detach in the background, so that you can have long batch jobs started, outputting stuff to the stdout, for example, but still don't be afraid to close down your terminal by accident etc.

Unfortunately screen has, IMO, quite an awkward syntax, but I managed to learn 3 flag combinations, and two keyboard combinations (from inside screen) that seems to be what I need for basic usage of screen:

Flags

Start a new named screen session:

screen -dmS ASessionName

List all detached screen sessions:
screen -ls

Re-attach a named screen session:
screen -r ASessionName

Shortcuts

Detach the current session in background:

Ctrl + a, Ctrl + d

Close current screen session:
Ctrl + d

Tags:

HPC Client Screencast: Experimental Job Config Wizard

My work at UPPMAX, on the Bioclipse based HPC Client i is progressing, slowly but steadily. I just screencasted an experimental version of the job configuration wizard, which loads command line tool definitions from the Galaxy workbench, and use them to generate a GUI for configuring the parameters to the command line tool in question, as well as the parameters for the Slurm Resource manager (used at UPPMAX). Have a look if you want :) :

The Wizard obviously has quite some rough edges still. My current TODO is as follows:

  • Set sensible default values i widgets (i.e. when there is just 1 alt)
  • Use checkboxes and radiobuttons for select fields with few options
  • Use progress bar between wizard pages that takes time to load
  • Decide how to take care of the cheetah #if#lse#endif syntax, available in some galaxy tool config files.
  • Add validation
  • Use a time widget for the job time
  • Add a custom view with just a "connect" button, and showing only remote files for the configured host.
  • More modular loading of modules (hierarchical etc.)
  • More advanced parsing of options (i.e. allowing to omit params, rather than just saying "no" on them).

Etc ... More suggestions? :)

(E)BNF parser for (parts of) the Galaxy ToolConfigs with ANTLR

As blogged earlier, I'm currently into parsing the syntax of some definitions for the parameters and stuff of command line tools. As said in the linked blog post, I was pondering whether to use the Galaxy Toolconfig format or the DocBook CmdSynopsis format. It turned out though Well, that cmdsynopsis lacks the option to specify a list of valid choices, for a parameter, as is possible in the Galaxy ToolConfig format (see here), and thus can be used to generate drop-down lists in wizards etc. which is basically what I want to do ... so, now I'm going with the Galaxy format after all.

Enter the Galaxy format then. Look at an example code snippet:

<tool id="sam_to_bam" name="SAM-to-BAM" version="1.1.1">
  <description>converts SAM format to BAM format</description>
  <requirements>
    <requirement type="package">samtools</requirement>
  </requirements>
  <command interpreter="python">
    sam_to_bam.py
      --input1=$source.input1
      --dbkey=${input1.metadata.dbkey} 
      #if $source.index_source == "history":
        --ref_file=$source.ref_file
      #else
        --ref_file="None"
      #end if
      --output1=$output1
      --index_dir=${GALAXY_DATA_INDEX_DIR}
  </command>
  <inputs>
    <conditional name="source">
      <param name="index_source" type="select" label="Choose the source for the reference list">
        <option value="cached">Locally cached</option>
        <option value="history">History</option>
      </param>
      <when value="cached">
      ... cont ...

Here I've got some challenges. XML parsing is easy, even in Java (I use the Java XPath libs for that). But look inside the <command> tag ... that's some really non-xml stuff, no? (it is instructions for a python based template library, used in galaxy). I have to parse this though, in order to replicate the logic of it ... so what to do? ... well, I turned to the ANTLR Parser Generator.

ANTLRWorks works nicely out of the box

I heard a lot of good things about ANTLR, like that it is more easily debugged than typical BNF parsers etc, so the choice wasn't that hard. I tried the ANTLR for Eclipse, but though it looks nice, it that was quite buggy, and I couldnt get it to work properly in neither Eclipse 3.5 or 3.6. So, finally I went with the easy option and developed my EBNF grammar in ANTLRWorks, which is an integrated Java App, with the correct ANTLR lib already installed etc. Turned out to work really good!

The grammar I came up with so far (only for the syntax inside the <command> tag so far, though!) is available on GitHub ... and below (in condensed syntax to save some space), for you convenience :)

grammar GalaxyToolConfig;
options {output=AST;}
 
command    : binary (ifstatement param+ (ELSE param+)? ENDIF | param)*;
binary     : WORD;
ifstatement 
        : IF (STRING|VARIABLE) EQTEST (STRING|VARIABLE) COLON;
param   : DBLDASH WORD* EQ (VARIABLE|STRING);
WORD    : ('a'..'z'|'A'..'Z')('a'..'z'|'A'..'Z'|'.'|'_'|'0'..'9')*;
VARIABLE 
        : '$'('{')?WORD('}')?;
STRING  : '"'('a'..'z'|'A'..'Z')+'"';
IF      : '#if';
ELSE    : '#else';
ENDIF   : '#end if';
EQ      : '=';
EQTEST  : '==';
DBLDASH : '--';
COLON   : ':';
WS      : (' '|'\t'|'\r'|'\n') {$channel=HIDDEN;};

Suggestions for improvements? :) ... Then go ahead and mail me ... samuel dot lampa at gmail dot com)

Also, see a little screenshot from ANTLRWorks below:

ANTLRWorks Screenshot

As you can see in the screenshot, the different parts have correctly been identified as "param", "if statement" and so forth. You can se also how I can click in the test syntax, to see where in the parse tree that actual part appears.

When done, I just exported the resulting parser code in ANTLRWorks with "Generate > Generate Code", copied the code from the "output" folder into my Eclipse project, added the antlr-3.3 jar into the build path of it, and then ran the __Test__.java file that comes with the output.

I wanted to do a little more parsing in my test though, so I ended up with this little test code:

package net.bioclipse.uppmax.galaxytoolconfigparser;
import org.antlr.grammar.v3.*;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.TokenStream;
import org.antlr.runtime.tree.CommonTree;
import org.antlr.runtime.tree.DOTTreeGenerator;
import org.antlr.runtime.tree.Tree;
import org.antlr.runtime.tree.TreeAdaptor;
import org.antlr.stringtemplate.StringTemplate;
 
public class ParseTest {
    // Generated stuff from ANTLR, which I can use to recognize token types   
    public static final int EOF=-1;
    public static final int ELSE=4;
    public static final int ENDIF=5;
    public static final int WORD=6;
    public static final int IF=7;
    public static final int STRING=8;
    public static final int VARIABLE=9;
    public static final int EQTEST=10;
    public static final int COLON=11;
    public static final int DBLDASH=12;
    public static final int EQ=13;
    public static final int WS=14;
 
    public static void main(String[] args) throws RecognitionException {
        String testString = "    sam_to_bam.py" 
                + "      --input1=$source.input1\n"
                + "      --dbkey=${input1.metadata.dbkey}\n"
                + "      #if $source.index_source == \"history\":\n"
                + "        --ref_file=$source.ref_file\n" 
                + "      #else\n"
                + "        --ref_file=\"None\"\n" 
                + "      #end if\n"
                + "      --output1=$output1\n"
                + "      --index_dir=${GALAXY_DATA_INDEX_DIR}\n"; 
        CharStream charStream = new ANTLRStringStream(testString);
        GalaxyToolConfigLexer lexer = new GalaxyToolConfigLexer(charStream);
        TokenStream tokenStream = new CommonTokenStream(lexer);
        GalaxyToolConfigParser parser = new GalaxyToolConfigParser(tokenStream, null);
 
        System.out.println("Starting to parse ...");
        // GalaxyToolConfigParser.command_return command = parser.command();
        CommonTree tree = (CommonTree)parser.command().getTree();
        System.out.println("Done parsing ...");
 
        int i = 0;
        while (i<tree.getChildCount()) {
            Tree subTree = tree.getChild(i);
            System.out.println("Tree child: " + subTree.getText() + ", (Token type: " + subTree.getType() + ")");
            i++;
        }
 
        // Generate DOT Syntax tree
        //DOTTreeGenerator gen = new DOTTreeGenerator();
        //StringTemplate st = gen.toDOT(tree);
        //System.out.println("Tree: \n" + st);
 
        System.out.println("Done!");
    }
}

... generating this output:

Starting ...
Done executing command ...
Subtree text: sam_to_bam.py, (Token type: 6)
Subtree text: --, (Token type: 12)
Subtree text: input1, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $source.input1, (Token type: 9)
Subtree text: --, (Token type: 12)
Subtree text: dbkey, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: ${input1.metadata.dbkey}, (Token type: 9)
Subtree text: #if, (Token type: 7)
Subtree text: $source.index_source, (Token type: 9)
Subtree text: ==, (Token type: 10)
Subtree text: "history", (Token type: 8)
Subtree text: :, (Token type: 11)
Subtree text: --, (Token type: 12)
Subtree text: ref_file, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $source.ref_file, (Token type: 9)
Subtree text: #else, (Token type: 4)
Subtree text: --, (Token type: 12)
Subtree text: ref_file, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: "None", (Token type: 8)
Subtree text: #end if, (Token type: 5)
Subtree text: --, (Token type: 12)
Subtree text: output1, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: $output1, (Token type: 9)
Subtree text: --, (Token type: 12)
Subtree text: index_dir, (Token type: 6)
Subtree text: =, (Token type: 13)
Subtree text: ${GALAXY_DATA_INDEX_DIR}, (Token type: 9)
Done!

... seemingly I have the stuff I need, for doing some logic parsing now! :)

Some words about BNF

ANTLR is an (E)BNF parser generator. I had heard a little about BNF before, and was more or less scared off from the topic, thinking it looked too advanced, but really, I found it isn't that hard at all!

It strikes me that BNF is quite much RegEx but with functions added, which allows for recursive pattern matching, which you'll need for anything more advanced, such as nested braces/xml tags etc ... but as you can see in the example above also, much of the pattern matching syntax actually has big similarities to RegEx.

In terms of tutorials, for the (E)BNF/ANTLR combo at least, I'd highly recommend this set of screencasts on using ANTLR in Eclipse. Though I didn't use the Eclipse version, these screencasts quickly give you an idea of how it all works ... I watched at least a bunch of them, and I'm happy I did.