Data-Intensive Computing – LCS & GBML Central

I just uploaded the technical report of the paper we put together for CEC 2010 on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the Hadoop implementation, it also presents some very compelling results obtained with MongoDB (a document based store able to perform parallel MapReduce tasks via sharding). […]

Abstract:
This paper shows how the extended compact genetic algorithm can be scaled using data-intensive computing techniques such as MapReduce. Two different frameworks (Hadoop and MongoDB) are used to deploy MapReduce implementations of the compact and extended com- pact genetic algorithms. Results show that both are good choices to deal with large-scale problems as they can scale with the number of commodity machines, as opposed to previous ef- forts with other techniques that either required specialized high-performance hardware or shared memory environments.

Soaring the Clouds with Meandre

You may find the slide deck and the abstract for the presentation we delivered today at the “Data-Intensive Research: how should we improve our ability to use data” workshop in Edinburgh. Abstract This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University […]

You may find the slide deck and the abstract for the presentation we delivered today at the “Data-Intensive Research: how should we improve our ability to use data” workshop in Edinburgh.

Abstract

This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University of Illinois and will introduce current research efforts to tackle the challenges presented by big-data. Research efforts include exploring potential ways of integration between cloud computing concepts—such as Hadoop or Meandre—and traditional HPC technologies and assets. These architecture models contrast significantly, but can be leveraged by building cloud conduits that connect these resources to provide even greater flexibility and scalability on demand. Orchestrating the physical computational environment requires innovative and sophisticated software infrastructure that can transparently take advantage of the functional features and to negotiate the constraints imposed by this diversity of computational resources. Research conducted during the development of the Meandre infrastructure has lead to the production of an agile conductor able to leverage the particular advantages in the physical diversity. It can also be implemented as services and/or in the context of another application benefitting from it reusability, flexibility, and high-scalability. Some example applications and an introduction to the data intensive infrastructure architecture will be presented to provide an overview of the diverse scope of Meandre usages. Finally, a case will be presented showing how software developers and system designers can easily transition to these new paradigms to address the primary data-deluge challenges and to soar to new heights with extreme application scalability using cloud computing concepts.

Scaling Genetic Algorithms using MapReduce

Below you may find the abstract to and the link to the technical report of the paper entitled “Scaling Genetic Algorithms using MapReduce” that will be presented at the Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009 by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month. Abstract:Genetic algorithms(GAs) are increasingly […]

Below you may find the abstract to and the link to the technical report of the paper entitled “Scaling Genetic Algorithms using MapReduce” that will be presented at the Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009 by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month.

Abstract:Genetic algorithms(GAs) are increasingly being applied to large scale problems. The traditional MPI-based parallel GAs do not scale very well. MapReduce is a powerful abstraction developed by Google for making scalable and fault tolerant applications. In this paper, we mould genetic algorithms into the the MapReduce model. We describe the algorithm design and implementation of GAs on Hadoop, the open source implementation of MapReduce. Our experiments demonstrate the convergence and scalability upto 105 variable problems. Adding more resources would enable us to solve even larger problems without any changes in the algorithms and implementation.

The draft of the paper can be downloaded as IlliGAL TR. No. 2009007. For more information see the IlliGAL technical reports web site.

Temporary storage for Meandre’s distributed flow execution

Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such […]

Transaction ready
Light weight implementation
Efficient write and read to minimize the contention on ports

Also, it is important to keep in mind that in a distributed execution scenario, each node requires to have its one separated and standalone storage system. Thus, it is also important to minimize the overhead of installation and maintenance of such storage subsystem. There are several alternatives available ranging from traditional relational data base systems to home-brewed solutions. Relational data base systems provide a distributed, reliable, stable, and well tested environment, but they may tend to require a quite involved installation and maintenance. Also, tuning those systems to optimize performance may required quite an involved monitoring and tweaking. On the other hand, home-brewed solutions can be optimized for performance by dropping non required functionality and focussing on writing and reading performance. However, such solutions tend to be bug prone and tend to become time consuming, not to mention that proving transaction correctness can be quite involved.

Fortunately there is a middle ground where efficient and stable transaction aware solutions are available. They may not provide SQL interfaces, but they still provide transaction boundaries. Also, since they are oriented to maximize performance, they can provide better throughput and operation latency than having to traverse the SQL stack. Examples of such storage systems can be found under the areas of key-value stores and column stores. Several options were considered while writing these line, but key-value stores were the ones that better matches the three requirements described above. Several options were informally tested, including solutions like HDF and Berkely DB, however the best performing by far under similar stress test conditions as the sketched temporary storage subsystem was Tokyo Cabinet. I already introduced and tested Tokyo Cabinet more than a year ago, but this time I was going to give it a stress test to basically convince myself that that was what I wanted to use for as temporary storage of the distributed flow execution.

The experiment

Tokyo cabinet is a collection of storage utilities including, among other facilities, key-value stores implemented as hash files or B-trees and flexible column stores. To illustrate the performance and throughput you can achieve. To implement multiple queues on a single casket (Tokyo Cabinet file containing the data store) B-trees with duplicated keys can help achieving such goal. The duplicated keys are the queue names, and the values are the UUIDs of the objects being store. Objects are also stored in the same B-tree by using the UIUD as a key and the value become the payload to store (usually an array of bytes).

Previously, I have been heavily using Python bindings to test Tokyo Cabinet, but this time I went down the Java route (since the Meandre infrastructure is written on Java). The Java bindings are basically build around JNI and statically link to the C version of Tokyo Cabinet library, giving away the best of both world. To measure how fast can I write data out of a port into the local storage in a transactional mode, I used the following piece of code.

	public static void main ( String args [] ) {
		int MAX = 10000000;
		int inc = 10;
		int cnt = 0;
		float fa [] = new float[8];
		int reps = 10;
 
		for ( int i=1 ; i<=MAX ; i*=inc  ) {
			//System.out.println("Size: "+i);
			for ( int j=0 ; j<reps ; j++ ) {	
				//System.out.println("\tRepetition: "+j);
 
				// open the database
				BDB bdb = new BDB();
 
				if(!bdb.open(TEST_CASKET_TCB, BDB.OWRITER | BDB.OCREAT | BDB.OTSYNC )){
					int ecode = bdb.ecode();
					fail("open error: " + bdb.errmsg(ecode));
				}
 
				// Add a bunch of duplicates
				long start = System.currentTimeMillis();
				bdb.tranbegin();
				for ( int k=0; k<i; k++ ) {
					String uuid = UUID.randomUUID().toString();
					bdb.putdup(QUEUE_KEY, uuid);
					bdb.putdup(uuid.getBytes(), uuid.getBytes());	
				}
				bdb.trancommit();
				fa[cnt] += System.currentTimeMillis()-start;
 
				// Clean up
				bdb.close();
				new File(TEST_CASKET_TCB).delete();
			}
			fa[cnt] /= reps;
			System.out.println(""+i+"\t"+fa[cnt]+"\t"+(fa[cnt]/i));
			cnt++;
		}
	}

The idea is very simple. Just go and star storing 1, 10, 100, 1000, 10000, 1000000, and 10000000 pieces of data at once in a transaction. Measure the time. For each data number repeat the operation 10 times and average the time trying to palliate the fact that the experiment was run on a laptop running all sorts of other concurrent applications. Plot the results to illustrate:

time required to insert one piece of data as a function of the number of data involve in the transaction
number of pieces of data wrote per second as a function of the number of data involve in the transaction

The idea is to expose the behavior of Tokyo Cabinet as more data is involved in a transaction to check if degradation happens as the volume increase. This is an important issue, since data intensive flows can generate large volumes of data per firing event.

The results

Results are displayed on the figures below.

The first important element to highlight is that the time to insert one data element does not degrade as the volume increase. Actually, it is quite interesting that Tokyo Cabinet feels more comfortable as the volume per transaction grows. The throughput results are also interesting, since it shows that it is able to sustain transfers of around 40K data units per second, and that the only bottleneck is the disk cache management and bandwidth to the disk itself—which gets saturated after pushing more than 10K pieces of data.

The lessons learned

Tokyo Cabinet is a excellent candidate to support the temporary transactional storage required in a distributed execution of a Meandre flow. Other alternatives like MySQL, embedded Apache Derby, the Java edition of Berkeley DB, SQLite JDBC could not get even get close to such performance falling at least one order of magnitude behind.

Liquid: RDF endpoint for FluidDB

A while ago I wrote some thoughts about how to map RDF to and from FluidDB. There I explored how you could map RDF onto FluidDB, and how to get it back. That got me thinking about how to get a simple endpoint you could query for RDF. Imagine that you could pull FluidDB data in RDF, then I could just get all the flexibility of SPARQL for free. With this idea in my mind I just went and grabbed Meandre, the JFLuidDB library started by Ross Jones, and build a few components.

The main goal was to be able to get an object, list of the tags, and express the result in RDF. FluidDB helps the mapping since objects are uniquely identified by URIs. For instance, the unique object 5ff74371-455b-4299-83f9-ba13ae898ad1 (FluidDB relies on UUID version four with the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx) is uniquely identified by http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 (or a url of the form http://sandbox.fluidinfo.com/objects/xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx), in case you are using the sandbox or http://fluiddb.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 if you are using the main instance. Same story for tags. The tag fluiddb/about can be uniquely identified by the URI http://sandbox.fluidinfo.com/tags/fluiddb/about, or http://fluiddb.fluidinfo.com/tags/fluiddb/about.

A simple RDF description for and object

Once you get the object back the basic translated RDF version for object a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db should look like as the listing below in TURTLE notation.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> .

I will break the above example into small chunks and explain the above example into the three main pieces involved (the id, the about, and the tags). The basic construct is simple. First a triple to mark the object as a FluidDB object.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/>   
.

Then if the object has an about associated on creation, another triple gets generated and added, as shown below. To be consistent, I suggest reusing DC description since that is what the about for an object tend to indicate.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> 
.

Finally, if there are tags associated to the object, a bag gets created, and all the URI describing the tags get pushed into the bag as shown below.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description>
.

Creating and RDF endpoint

Armed with the previous, the thing should be easy. Just allow querying for objects, then collect the object information, and finally generate the final RDF. Using Meandre and JFLuidDB I wrote a few components that allow the simple creation of such an endpoint as illustrated by the picture below.

The basic mechanism is simple. Just push the query into the Query for objects component. This component will stream each of the uuid of the matched objects to Read object which pulls the object information. Then the object is passed to Object to RDF model that basically generates the RDF snipped shown in the example shown above for each of the objects pushed. Finally all the RDF fragments are reduced together by component Wrapped models reducer. Then the resulting RDF model just gets serialize into text using the Turtle notation. Finally the serialized text is printed to the console. The equivalent code could be express as a ZigZag script as:

#
# Imports eliminated for clarity
#

#
# Create the component aliases
#
alias  as OBJECT_TO_RDF
alias  as PRINT_OBJECT
alias  as QUERY_FOR_OBJECTS
alias  as READS_THE_REQUESTED_OBJECT
alias  as WRAPPED_MODELS_REDUCER
alias  as MODEL_TO_RDF_TEXT
alias  as PUSH_STRING

#
# Create the component instances
#
push_query_string = PUSH_STRING()
wrapped_models_reducer = WRAPPED_MODELS_REDUCER()
query_for_objects = QUERY_FOR_OBJECTS()
reads_object = READS_THE_REQUESTED_OBJECT()
model_to_rdf_text = MODEL_TO_RDF_TEXT()
print_rdf_text = PRINT_OBJECT()
object_to_rdf_model = OBJECT_TO_RDF()

#
# Set component properties
#
push_query_string.message = "has fluiddb/tag/path"
query_for_objects.fluiddb_url = "http://sandbox.fluidinfo.com"
eads_object.fluiddb_url = "http://sandbox.fluidinfo.com"
model_to_rdf_text.rdf_dialect = "TTL"

#
# Create the flow by connecting the components
#
@query_for_objects_outputs = query_for_objects()
@model_to_rdf_text_outputs = model_to_rdf_text()
@push_query_string_outputs = push_query_string()
@object_to_rdf_model_outputs = object_to_rdf_model()
@reads_object_outputs = reads_object()
@wrapped_models_reducer_outputs = wrapped_models_reducer()

query_for_objects(text: push_query_string_outputs.text)
model_to_rdf_text(model: wrapped_models_reducer_outputs.model)
object_to_rdf_model(object: reads_object_outputs.object)
reads_object(uuid: query_for_objects_outputs.uuid)[+200!]
print_rdf_text(object: model_to_rdf_text_outputs.text)
wrapped_models_reducer(model: object_to_rdf_model_outputs.model)

The only interesting element in the script is the [+200!] entry that creates 200 parallel copies of read object that will concurrently hit FluidDB to pull the data, trying to minimize the latency. The script could be compiled into a MAU and run. The output of the execution would look like the following:

$ java -jar zzre-1.4.7.jar pull-test.mau 
Meandre MAU Executor [1.0.1vcli/1.4.7]
All rights reserved by DITA, NCSA, UofI (2007-2009)
THIS SOFTWARE IS PROVIDED UNDER University of Illinois/NCSA OPEN SOURCE LICENSE.
 
Executing MAU file pull-test.mau
Creating temp dir pull-test.mau.run
Creating temp dir pull-test.mau.public_resources
 
Preparing flow: meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/
2009-09-24 12:34:38.480::INFO:  jetty-6.1.x
2009-09-24 12:34:38.495::INFO:  Started SocketConnector@0.0.0.0:1715
Preparation completed correctly
 
Execution started at: 2009-09-24T12:34:38
----------------------------------------------------------------------------
<http://sandbox.fluidinfo.com/objects/a24b4a18-5483-47c6-9b62-0955210c7ebd>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253772095.82845-0.944567286499904"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253622685.3231461-0.437099602163897316"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/67e52346-527e-4bb7-b8f3-05fa8a8ae35b>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253620190.69175-0.861614257420541"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/8a65a184-03d9-4881-95df-02fa0561a86f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/update/exceptions"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/335b44e9-a72f-479d-ad60-3661a35231ba>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253776141.95577-0.284175700598524"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/3bbf1cc6-731c-4e56-a664-adeb5484334f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/delete/policy"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/aba5adcf-fd44-40ab-b702-9cc635650bc3>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253614713.757-0.604769721717702"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/f61ceb3b-33df-4356-8e7d-c56d3d0ae338>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253615887.80879-0.0437609496034099"^^<http://www.w3.org/2001/XMLSchema#string> .
 
...

That’s it! A first RDF dump of the query!

The not so great news

The current FluidDB API does not provide any method to be able to pull data from more than one object at once. That basically means, that for each uuid a call to the server needs to be process. That is a huge latency overhead. The FluidDB guys know about it and they are scratching their heads on how to provide a “multi get”. A full trace of the output can be found on this FluidDB RDF endpoint trace.

This element is crucial for any RDF endpoint. Above I left out a basic element, the time measures. That part looks like:

Flow execution statistics

Flow unique execution ID : meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/8D8E354A/1253813678323/1493255769/
Flow state               : ended
Started at               : Thu Sep 24 12:34:38 CDT 2009
Last update              : Thu Sep 24 12:37:28 CDT 2009
Total run time (ms)      : 170144

Basically 170s to pull only 238 objects, where all the time is spent round tripping to FluidDB.

Getting there

This basically means that such high latency would not allow efficient interactive usage of the end point. However, this exercise was useful to prof that simple RDF endpoints for FluidDB are possible and would greatly boost the flexibility of interaction with FluidDB . The current form of the endpoint is may still have value if you are not in a hurry, allowing you to run SPARQL queries against FluidDB data and get the best of both worlds.

The code use

If you are interested on running the code, you may need Meandre and the components I put together for the experiment, that you can get from http://github.com/xllora/liquid.

Liquid: RDF meandering in FluidDB

Meandre (NCSA pushed data-intensive computing infrastructure) relies on RDF to describe components, flows, locations and repositories. RDF has become the central piece that makes possible Meandre’s flexibility and reusability. However, one piece still remains largely sketchy and still has no clear optimal solution: How can we facilitate to anybody sharing, publishing and annotating flows, components, locations and repositories? More importantly, how can that be done in the cloud in an open-ended fashion and allow anybody to annotate and comment on each of the afore mentioned pieces?

The FluidDB trip

During my last summer trip to Europe, Terry Jones (CEO) invited me to visit FluidInfo (based in Barcelona) where I also meet Esteve Fernandez (CTO). I had a great opportunity to chat with the masterminds behind an intriguing concept I ran into after a short note I received from David E. Goldberg. FluidDB, the main product being pushed by FluidInfo, is an online collaborative “cloud” database. On FluidInfo words:

FluidDB lets data be social. It allows almost unlimited information personalization by individual users and applications, and also between them. This makes it simple to build a wide variety of applications that benefit from cooperation, and which are open to unanticipated future enhancements. Even more importantly, FluidDB facilitates and encourages the growth of applications that leave users in control of their own data.

FluidDB went live on a private alpha last week. The basic concept behind the scenes is simple. FluidDB stores objects. Objects do not belong to anybody. Objects may be “blank” or they may be about something (e.g. http://seasr.org/meandre). You can create as many blank objects as you want. Creating an object with the same about always returns the same object (thus, there will only be one object about http://seasr.org/meandre). Once objects exists, things start getting more interesting, you can go and tag any object with whatever tag you want. For instance I could tag the http://seasr.org/meandre object hosted_by tag, and assign the tag the value FluidDB introduces one last trick: namespaces. For instance, I got xllora. that means that the above tag I mentioned would look like /tag/xllora/hosted_by. You can create as many nested namespaces under your main namespace as you want. FluidDB also provides mechanisms to control who can query and see the values of your created tags.

As you can see, the basic object model and mechanics is very simple. When the alpha went live, FluidDB only provide access via a simple REST-like HTTP API. In a few days a blossom of client libraries that wrap that API were develop by a dynamic community that gather on #fluiddb channel on irc.freenode.net where FluidDB

You were saying something about RDF

Back to the point. One thing I chatted with the FluidDB guys was what did they think about the similarities between FluidDB’s object model and RDF. After playing with RDF for a while, the FluidDB model look awfully familiar, despite a much simplified and manageable model than RDF. They did not have much to say about it, and the question got stuck in the back of my mind. So when I got access to the private alpha, I could not help it but get down the path of what would it mean to map RDF on FluidDB. Yes, the simple straight answer would be to stick serialized RDF into the value of a given tag (e.g. xllora/rdf). However, that option seemed poor, since I could not exploit the social aspect of collaborative annotations provided by FluidDB. So back to the drawing board. What both models have in common: They are both descriptions about something. In RDF you can see those as the subjects of the triple predicates, whereas in FluidDB those are simple objects. RDF use properties to qualify objects. FluidDB uses tags. Both enable you to add value to qualified objects. Mmh, there you go.

With this idea in mind, I started Liquid, a simple proof-of-concept library that maps RDF on to FluidDB and then it gets it back. There was only one thing that needed a bit of patching. RDF properties are arbitrary URIs. Those could not be easily map on the top of FluidDB tags, so I took a simple compromise route.

RDFs subject URIs are mapped onto FluidDB qualified objects via the about tag
One FluidDB tag will contain all the properties for that object (basically a simple dictionary encoded in JSON)
Reference to other RDF URIs will be mapped on to FluidDB object URIs, and vice versa

Let’s make it a bit more chewable with a simple example.

<?xml version="1.0"?>
 
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http://www.recshop.fake/cd#">
 
<rdf:Description
rdf:about="http://www.recshop.fake/cd/Empire Burlesque">
  <cd:artist>Bob Dylan</cd:artist>
 </rdf:Description>
 
</rdf:RDF>

The above RDF represents a single triple

http://www.recshop.fake/cd/Empire Burlesque	http://www.recshop.fake/cd#artist	   "Bob Dylan"

This triple could be map onto FluidDB by creating one qualified FluidDB object and adding the proper tags. The example below shows how to do so using Python’s fdb.py client library by Nicholas J. Radcliffe.

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Create the tags (if they exist, this won't hurt)
#
f.create_abstract_tag(__RDF_TAG__)
f.create_abstract_tag(__RDF_TAG_PROPERTIES__)
f.create_abstract_tag(__RDF_TAG_MODEL_NAME__)
#
# Create the subject object of the triple
#	
o = f.create_object('http://www.recshop.fake/cd/Empire Burlesque')
#
# Map RDF properties
#
properties = {'http://www.recshop.fake/cd#artist':['Bob Dylan']}
#
# Tag the object as RDF aware, properties available, and to which model/named graph 
# it belongs
#
f.tag_object_by_id(o.id, __RDF_TAG__)
f.tag_object_by_id(o.id,__RDF_TAG_PROPERTIES__,value=json.dumps(properties))
f.tag_object_by_id(o.id, __RDF_TAG_MODEL_NAME__,'test_dummy')

Running along with this basic idea, I quickly stitched a simple library (Liquid) that allows ingestion and retrieval of RDF from FluidDB. It is still very rudimentary and may not totally map properly all possible RDF, but it is a working proof-of-concept implementation that it is possible to do so.

The Python code above just saves a triple. You can easy retrieve the triple by performing the following operation

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Retrieve the annotated objects
#
objs = f.query('has xllora/%s'%(__RDF_TAG__))
#
# Optionally you could retrieve the ones only belonging to a given model by
#
# objs = fdb.query('has xllora/%s and xllora/%s matches "%s"'%(__RDF_TAG__,__RDF_TAG_MODEL_NAME__,modelname))
#
subs = [f.get_tag_value_by_id(s,'/tags/fluiddb/about') for s in objs]
props_tmp = [f.get_tag_value_by_id(s,'/tags/xllora/'+__RDF_TAG_PROPERTIES__) for s in objs]
props = [json.loads(s[1]) if s[0]==200 else {} for s in props_tmp]

Now subs contains all the subject URIs for the predicates, and props all the dictionaries containing the properties.

The bottom line

OK. So, what is this mapping important? Basically, it will allow collaborative tagging of the created objects (subjects), allowing a collaborative and social gathering of information, besides them mapped RDF. So, what does it all means?

It basically means, that if you do not have the need to ingest RDF (where property URIs are not directly map and you need to Fluidify/reify), any data stored in FluidDB is already on some form of triplified RDF. Let me explain what I mean by that. Each FluidDB has a unique URI (e.g. http://fluidDB.fluidinfo.com/objects/4fdf7ff4-f0da-4441-8e63-9b98ed26fc12). Each tag is also uniquely identified by an URI (e.g. http://fluidDB.fluidinfo.com/tags/xllora/rdf_model_name). And finally each pair object/tag may have a value (e.g. a literal 'test_dummy' or maybe another URI http://fluidDB.fluidinfo.com/objects/a0dda173-9ee0-4799-a507-8710045d2b07). If a object/tag does not have a value you can just point it to the no value URI (or some other convention you like).

Having said that, now you have all the pieces to express FluidDB data in plain shareable RDF. That would mean basically get all the tags for and object, query the values, and then just generate and RDF model by adding the gathered triples. That’s easy. Also, if you align your properties to tags, the ingestion would also become that trivial. I will try to get that piece into Liquid as soon as other issues allow me to do so .

Just to close, I would mention once again a key element of this picture. FluidDB opens the door to a truly cooperative, distributed, and online fluid semantic web. It is one of the first examples of how annotations (a.k.a. metadata) can be easily gathered and used on the “cloud” for the masses. Great job guys!

Category: Data-Intensive Computing

Parallel and Distributed Computational Intelligence book is out for pre-order

Meandre 2.0 Alpha Preview = Scala + MongoDB

Scaling eCGA Model Building via Data-Intensive Computing