Making Deniz a single-file-app

Following up to the previous post, Deniz, the RDF browser written in HTML, Javascript & CSS, can now be distributed as one single file.

This is possible due do

The last step missing was the image embedding part which is nicely solved through https://github.com/nzakas/cssembed. In addition Deniz will now go through the Google Closure Javascript compiler and Yahoo’s YUI Compressor for CSS to save bandwidth.

Thanks to the Makefile by Benjamin Lupton (https://github.com/balupton/jquery-sparkle/blob/master/Makefile) it was easy to set the process up for Deniz.

Two steps will build the file:

$ make build-update

to download JAR dependencies, and

$ make

to finally minimize and integrate all contents.

That’s it.

Advertisements

Embedding external CSS & Javascript into the base HTML document

So I’m stuck on the train for some hours, why not solve a problem that is far from pressing?

I am developing a web application based only on HTML, CSS & Javascript, called Deniz (http://cburgmer.github.com/deniz/). It’s a browser for RDF data and only needs a browser to run in, as it will connect to public data endpoints. So while it is build up from many different sources it would be nice if the whole application could be delivered in a single file. While this could speed up loading, the main idea here is to distribute just one HTML file.

Looking around there are many services and libraries for compressing and aggregating CSS & JS files, but so far I haven’t found a solution specifically for what I try to achieve.

I’ve now come up with an implementation which parses the DOM tree and looks for elements with references to stylesheets and <script> tags
referecing external Javascript code. The program will read in the contents of the referenced files and paste it into the document. This is harder than it initially seems: XHTML which I assume here, needs to have data wrapped in a CDATA directive. I had to fight with the Python lxml library for some time to get this straight:

  1. The parser needs to be passed “strip_cdata” so that read CDATA blocks are preserved.
  2. Code needs to be wrapped in an instance of the CDATA class
  3. A dirty hack to quote the encapsulated CDATA blocks in multi-line comments to accommodate older browsers:

        html.replace('<![CDATA[', '/*<![CDATA[*/').replace(']]>', '/*]]>*/')

  4. While a proper solution would need to parse CSS & Javascript code to quote invalid HTML entities, another dirty hack makes sure that the text ‘</script>’
    in Javascript strings gets quoted:
            content = (content.replace('</script>"', '</scr" + "ipt>"')                          .replace("</script>'", "</scr' + 'ipt>'"))

Warning: This script is not suited to parse any JS & CSS. It does though work for my task.

The source can be found here: http://github.com/cburgmer/deniz/blob/master/embed_media.py

The next step will be to include images as base64 urls.

A side-by-side diff view.

Just a short post about a side-by-side diff algorithm I implemented on top of Google’s diff-match-patch library. It is modelled after the Wikipedia one

The snippet is posted under http://code.activestate.com/recipes/577784-line-based-side-by-side-diff/

A screenshot that was generated by this algorithm can be seen on http://jsfiddle.net/hRS9N/1/

Python has a side-by-side diff view implementation in difflib, but it sure isn’t up to current web standards ( everywhere …) and not adaptable at all. After all why would you limit the whole implementation to only return a set of html if the user knows best how to render the diff’s outcome anyway. Also, diff-match-patch is probably way better performance wise. As described under http://code.google.com/p/google-diff-match-patch/wiki/LineOrWordDiffs there is a quicker solution to solve this problem for a similar outcome. However changes are calculated and shown in a pretty coarse way. I’ll post about an application of this diff later on.

Using proper timezone information with SuRF

A short note on how best to deal with date, time and datetime objects in SuRF.

Chances are that you are living in a timezone other than UTC (an imaginary timezone) and want to properly handle time. Coming from SQL systems people might not be used to having a databases store additional timezone information (at least MySQL doesn’t). RDF stores like AllegroGraph and Virtuoso however follow the XML and more precisely the ISO 8601 standard when storing date & time objects and make your life easier.

At least if you follow this short suggestion here.

As I tried to document in http://code.google.com/p/surfrdf/wiki/BackendPeculiarities, Virtuoso and AllegroGraph handle datetime objects differently. When presented with a timezone-less date, Virtuoso assumes the server’s timezone, while AllegroGraph uses UTC (“Z”). You are probably using Python and so you have to deal with this, as Python doesn’t use timezones out of the box. RDFLIB also ignores timezones for now which will hopefully change once http://code.google.com/p/rdflib/issues/detail?id=169 is implemented.

If you want to make sure the correct timezone is stored, look at the example below. This code uses pytz to get the UTC timezone and stores datetime.now() as UTC. This will make both Virtuoso & AGraph store the same date. If you don’t intend to store values as UTC, look into pytz which has brilliant support that Python is missing.

https://gist.github.com/1002887

A small fact on the side: AllegroGraph normalizes all timestamps to UTC – so the offset gets lost. It did give me some headaches, it might so give you.

Python private attribute annoyance

Having some Java history I do like the concept of protected and private attributes for hiding the implementation details. I also like the forgiving way of Python when accessing those attributes as it doesn’t do any access checking and does allow access to private attributes e.g. for debugging purposes:

This attribute here

class A(object):     def __init__(self):         self.__a = 1

can be accessed like this

a = A() print a._A__a

The concept of prepending the classes name to the attribute’s name is called “name mangling”. This is an easy solution for hiding the private value from the interface.

However I just tripped over a small issue with name mangling here. Consider the following example which is a common pattern when calculating resource-
hungry values:

class A(object):     def a(self):         if not hasattr(self, '__a'):             print "generating a"             self.__a = 1         return self.__a

Now let’s run the method:

>>> a = A() >>> a.a() generating a 1 >>> a.a() generating a 1

Obviously hasattr doesn’t check for private attributes as expected and the value gets recalulated over and over again. What I should actually do is check for ‘_A__a’ which is kind of counterintuitive here. See also http://bugs.python.org/issue8264.

Now that was annoying.

Comparing Django ORM, SQLAlchemy & SuRF

The query interface is the part you see most of your favourite ORM, I believe. So here’s an overview on how three ORMs for Python offer querying: Django ORM, SQLAlchemy and SuRF. The former two are well known for SQL, the latter is a relatively new interface to RDF data (queried foremost by SPARQL).

My goal is to see what methods are offered to query data and to compare those to each other. Here I’ll be coming from the Django side, comparing equivalent methods to SQLAlchemy and also to SuRF. Don’t get me wrong, SQLAlchemy users probably come from a different angle but then I don’t use it often enough to know the full ORM details. For my usecase this suffices so far.

The real goal here is to see what needs to and can yet be done for SuRF. I already started developing Django style complex Q queries and slicing and want to see how SQLAlchemy does it. And also what else I need to take care of.

This list is far from complete, nor, as said, does it present an unbiased view. I might extend this list in the future. Feel free to note errors and other important differences.

Sources

On SQLAlchemy

I personally don’t have much experience with the SQLAlchemy ORM, so take the examples for SQLAlchemy with a grain of salt. I’ll use the query_property here (see http://www.sqlalchemy.org/docs/05/reference/orm/sessions.html) so that it compares more easily to Django. However I don’t know if this way is generally accepted in the SQLAlchemy world.

Here is the way the tutorial of SQLAlchemy puts it:

session.query(User).all()

Using the query_property it boils down to:

class MyClass(object):
    query = db_session.query_property()

MyClass.query.all()

On SuRF

SuRF being an Object Relational Mapper for RDF data does many things differently. Most importantly a property has always a list of values – it might be empty, have one value or several. Also some functionality like aggregates was only specified in SPARQL 1.1 and has yet to be implemented by many backend stores. SuRF knows namespaces and thus a property is referenced by its namespace, here “myns” for property “prop”.

The comparison

Django ORM SQLAlchemy SuRF Description
MyClass.objects.all() MyClass.query.all() MyClass.all() All elements
MyClass.objects.filter(prop=10) MyClass.query.filter(MyClass.prop==10)
MyClass.query.filter_by(prop==10)
MyClass.get_by(myns_prop=10) Query by parameter
MyClass.objects.get(pk=10) MyClass.query.get(10) MyClass.get_by(myns_pk=10).one() Unique key
MyClass.objects.get(prop=10) MyClass.query.filter(MyClass.prop==10).one() MyClass.get_by(myns_prop=10).one() One exact result
? MyClass.query.filter(MyClass.prop==10).first() MyClass.get_by(myns_prop=10).first() First result
MyClass.objects.filter(prop__gt=10) MyClass.query.filter(MyClass.prop > 10).all() MyClass.all().filter(myns_prop=”(%s > 10)”) Greater than filtering
MyClass.objects.filter(prop__in=[1, 2]) MyClass.query.filter(MyClass.prop.in_([1, 2])) MyClass.get_by(myns_prop=[1, 2]) In list
MyClass.objects.filter(
prop__startswith=’somethin’)

MyClass.query.filter(
MyClass.prop.like=’somethin%’)

MyClass.all().filter(
myns_prop=”regex(%s,”^somethin”,”i”)”)
Substring
MyClass.objects.exclude(prop=10) MyClass.objects.filter(~MyClass.prop == 10) Negative search
MyClass.objects.all().count() MyClass.query.all().count() len(MyClass.all()) Result count
MyClass.objects.all().delete() MyClass.query.all().delete() Batch removal
MyClass.objects.all().exist() ? Boolean exist
MyClass.objects.all().order_by(“-prop”) MyClass.query.all().order_by(
desc(MyClass.prop))
MyClass.all().order(ns.MYNS.prop).desc() Descending ordering
default default MyClass.all().full() Preload properties
MyClass.objects.all().select_related(prop) session.query(MyClass)
.options(joinedload(‘prop’)).all()
Eagerly load relations
MyClass.objects.aggregate(Avg(‘prop’)) session.query(func.avg(MyClass.prop)).all() Aggregates
MyClass.objects.all()[1:10] MyClass.query.all()[1:10] MyClass.all().offset(1).limit(10) Slicing
MyClass.objects.all().only(“prop”) session.query(MyClass.prop).all() Performance
MyClass.prop.remove(i) MyClass.prop.remove(i) MyClass.myns_prop.remove(i) Remove from one-to-many relationship
MyClass.objects.filter(
Q(prop=’x’) | Q(prop=’y’))
MyClass.query.filter(or_(MyClass.prop==’x’, MyClass.prop==’y’)) Complex expression

Using Selenium to validate XHTML markup using lettuce

I am currently getting started on providing unit tests for Deniz with Selenium. While deniz is pure Javascript (with HTML & CSS) I am using Python with Lettuce (clone of Ruby’s cucumber) to test the application. Lettuce is a behaviour driven development (BDD) tool and makes testing clean and fun.

In this fashion I wanted to check that the W3C XHTML button is placed correctly, i.e. that deniz actually is valid XHTML. So automatic testing comes into play.

Here’s a small receipt to formulate the validation test. There are some quirks when using Firefox that needed a workaround, so I thought I share:

https://gist.github.com/924043

So I found out that while "static" deniz.html is valid XHTML the components that get embedded once Javascript is run breaks the document’s validity.

Now that brings me to a new question: Should AJAX-style webpages stricly adhere to W3C standards while operating, i.e. going through various states of the application, each changing the underlying HTML code? I guess so.

jquery-shiftenter

Wanting to quickly post a HTML form made of a textarea I found nothing on the web that would quickly allow me to press Enter to submit the contents. Similar to how Facebook got rid of their comment button (simply press enter when finished commenting) I want to use the Return key to submit the form while retaining the possibility to create line breaks by hitting Shift+Enter.

I thus came up with jquery-shiftenter, a simple jQuery plug-in to turn your textareas into an input accepting Enter to submit and showing a textual hint on how to generate newlines.

You can find an example here. The code is on github. Enjoy.

Deniz, a simple Javascript-based RDF browser

I’d like to announce a little project started recently called “Deniz”, a browser to view your RDF data.

 

While developing our software Trip based on an RDF “triple” store we need to look at our data on a daily basis. Most often this is just for debugging purposes and bad enough we are limited to what the triple stores offer. In case of Franz AllegroGraph the store already ships with an AJAX-based browser, but for example Virtuoso only supports a simple SPARQL interface.

 

Goals

 

The goal of Deniz is to implement a simple and lightweight browsing application to query RDF stores using SPARQL. More specific it builds on top of stores implementing the SPARQL protocol defined by the W3C which makes the application dead simple. A SPARQL query string sent to such a server will return a JSON structure of results that can easily be turned into a human-readable table.

 

Inspired by AllegroGraph’s browser it will probably inherit many ideas, but so far I want to keep the following points in mind when improving Deniz:

 

  • Easy to use
    • Deniz is not designed to become a phpMyAdmin for RDF nor anyway near.
  • Transparent SPARQL usage
    • You can quickly learn SPARQL by looking at the SPARQL code the different views use. The expert in turn will quickly see what the views offer and what not.
  • Practical usage
  • We use RDF to solve problems. Deniz should help us rather then offer a complete set of operations on the triple store.

 

SPARUL (SPARQL/Update) might go into the interface in the future but I can’t say for sure.

 

CORS

 

One technique needed to access an RDF store via AJAX is CORS (“Cross-Origin Resource Sharing”). CORS offers a standardised solution around the same domain limitation forced upon Javascript connections out of security concerns. In particular the browser will disallow any request made to servers from a domain different to where the originating query’s page is served from. Via CORS a server can explicitly allow cross-domain requests and we will use that here.

 

In theory this means that Deniz will completely work without the need of any server-side deployments. You only need a SPARQL endpoint that does support CORS. This is the case for example for dbpedia.org. Deniz already queries DBpedia by default. If you don’t have such support for the store you want to query, read below for a simple solution.

 

Implementation

 

Deniz is implemented in Javascript, HTML & CSS – so far my first project with this setting. You can run it from your local hard drive or deploy it on a standard non-CGI/PHP web server. It uses jQuery, jQuery UI (both pretty basic) and CodeMirror as syntax highlighter (can you believe it ships with a SPARQL highlighter by default?). A nice addition is the jQuery history plug-in which offers back & forward browsing as if Deniz was a fullblown web application.

 

Currently the monolithic deniz.html could do with some refactoring. This will probably come once my initial feature set is implemented. For example one point missing on my list is easy GRAPH support.

 

SPARQL protocol proxy

 

If you either don’t have a SPARQL protocol compatible store or need the CORS support described above then the SPARQL protocol proxy might work for you. It was explicitly started to offer the missing layer to Deniz and is implemented as a small HTTP-server written in pure Python. It is far from supporting anything near 100% of the SPARQL protocol and until now has only been tested with Virtuoso, but it might suite your needs.

 

Demo & License

 

You can find the demo here . By default dbpedia.org is selected as endpoint, but you can change it to your own triple store (see above though for CORS support).

 

Deniz is released under a new BSD license, so you are pretty free to do what ever you like with it.

 

And before I forget, “deniz” is Turkish for “sea”. Now you also improved your language skills while reading about the semantic web, isn’t that nice?

 

Update: Virtuoso has been release some days ago with CORS support. I am still looking into how to enable it, and eventually I’ll find out how.

Update2: See http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtTipsAndTricksGuide… on how to configure Virtuoso.