Meandre 2.0 Alpha Preview = Scala + MongoDB

A lot of water under the bridge has gone by since the first release of Meandre 1.4.X series. In January I went back to the drawing board and start sketching what was going to be 1.5.X series. The slide deck



A lot of water under the bridge has gone by since the first release of Meandre 1.4.X series. In January I went back to the drawing board and start sketching what was going to be 1.5.X series. The slide deck embedded above is a extended list of the thoughts during the process. As usual, I started collecting feedback from people using 1.4.X in production, things that worked, things that needed improvement, things that were just plain over complicated. The hot recurrent topics that people using 1.4.X could be mainly summarized as:

  • Complex execution concurrency model based on traditional semaphores written in Java (mostly my maintenance nightmare when changes need to be introduced)
  • Server performance bounded by JENA‘s persistent model implementation
  • State caching on individual servers to boost performance increases complexity of single-image cluster deployments
  • Could-deployable infrastructure, but not cloud-friendly infrastructure

As I mentioned, these elements where the main ingredients to target for 1.5.X series. However as the redesign moved forward, the new version represented a radical disruption from 1.4.X series and eventually turned up to become the 2.0 Alpha version described here. The main changes that forced this transition are:

  • Cloud-friendly infrastructure required rethinking of the core functionalities
  • Drastic redesign of the back-end state storage
  • Revisited flow execution engine to support flow execution
  • Changes on the API that render returned JSON documents incompatible with 1.4.X

Meandre 2.0 (currently already available in the the SVN trunk) has been rewritten from scratch using Scala. That decision was motivated to benefit from the Actor model provided by Scala (modeled after Erlang‘s actors). Such model greatly simplify the mechanics of the infrastructure, but it also powered the basis of Snowfield (the effort to create a scalable distributed flow execution engine for Meandre flows). Also, the Scala language expressiveness has greatly reduced the code based size (2.0 code base is roughly 1/3 of the size of 1.4.X series) greatly simplifying the maintenance activities the infrastructure will require as we move forward.

The second big change that pushed the 2.0 Alpha trigger was the redesign of the back end state storage. 1.4.X series heavily relied on the relational storage for persistent RDF models provided by JENA. For performance reasons, JENA caches the model in memory and mostly assumes ownership of the model. Hence, if you want to provide a single-image Meandre cluster you need to inject into JENA cache coherence mechanics, greatly increasing the complexity. Also, the relational implementation relies on the mapping model into a table and triple into a row (this is a bit of a simplification). That implies that large number of SQL statements need to be generated to update models, heavily taxing the relational storage when changes on user repository data needs to be introduced.

An ideal cloud-friendly Meandre infrastructure should not maintain state (neither voluntarily, neither as result of JENA back end). Thus, a fast and scalable back end storage could allow infrastructure servers to maintain no state and be able to provide the appearance of a single image cluster. After testing different alternatives, their community support, and development roadmap, the only option left was MongoDB. Its setup simplicity for small installations and its ability to easily scale to large installations (including cloud-deployed ones) made MongoDB the candidate to maintain state for Meandre 2.0. This was quite a departure from 1.4.x series, where you had the choice to store state via JENA on an embedded Derby or an external MySQL server.

A final note on the building blocks that made possible 2.0 series. Two other side projects where started to support the development of what will become Meandre 2.0.X series:

  1. Crochet: Crochet targets to help quickly prototype REST APIs relying on the flexibility of the Scala language. The initial ideas for Crochet were inspired after reading Gabriele Renzi post on creating a picoframework with Scala (see http://www.riffraff.info/2009/4/11/step-a-scala-web-picoframework) and the need for quickly prototyping APIs for pilot projects. Crochet also provides mechanisms to hide repetitive tasks involved with default responses and authentication/authorization piggybacking on the mechanics provided by application servers.
  2. SnareSnare is a coordination layer for distributed applications written in Scala and relies and MongoDB to implement its communication layer. Snare implements a basic heartbeat system and a simple notification mechanism (peer-to-peer and broadcast communication). Snare relies on MongoDB to track heartbeat and notification mailboxes.

Soaring the Clouds with Meandre

You may find the slide deck and the abstract for the presentation we delivered today at the “Data-Intensive Research: how should we improve our ability to use data” workshop in Edinburgh. Abstract This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University […]

Related posts:

  1. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. [BDCSG2008] Clouds and ManyCores: The Revolution (Dan Reed)

You may find the slide deck and the abstract for the presentation we delivered today at the “Data-Intensive Research: how should we improve our ability to use data” workshop in Edinburgh.

Abstract

This talk will focus a highly scalable data intensive infrastructure being developed at the National Center for Supercomputing Application (NCSA) at the University of Illinois and will introduce current research efforts to tackle the challenges presented by big-data. Research efforts include exploring potential ways of integration between cloud computing concepts—such as Hadoop or Meandre—and traditional HPC technologies and assets. These architecture models contrast significantly, but can be leveraged by building cloud conduits that connect these resources to provide even greater flexibility and scalability on demand. Orchestrating the physical computational environment requires innovative and sophisticated software infrastructure that can transparently take advantage of the functional features and to negotiate the constraints imposed by this diversity of computational resources. Research conducted during the development of the Meandre infrastructure has lead to the production of an agile conductor able to leverage the particular advantages in the physical diversity. It can also be implemented as services and/or in the context of another application benefitting from it reusability, flexibility, and high-scalability. Some example applications and an introduction to the data intensive infrastructure architecture will be presented to provide an overview of the diverse scope of Meandre usages. Finally, a case will be presented showing how software developers and system designers can easily transition to these new paradigms to address the primary data-deluge challenges and to soar to new heights with extreme application scalability using cloud computing concepts.

Related posts:

  1. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. [BDCSG2008] Clouds and ManyCores: The Revolution (Dan Reed)

Fast REST API prototyping with Crochet and Scala

I just finished committing the last changes to Crochet and tagged version 0.1.4vcli now publicly available on GitHub (http://github.com/xllora/Crochet). Also feel free to visit the issues page in case you run into question/problems/bugs. Motivation Crochet is a light weight web framework oriented to rapid prototyping of REST APIs. If you are looking for a Rails […]

Related posts:

  1. Meandre 2.0 Alpha Preview = Scala + MongoDB
  2. Meandre is going Scala
  3. Fast mutation implementation for genetic algorithms in Python

I just finished committing the last changes to Crochet and tagged version 0.1.4vcli now publicly available on GitHub (http://github.com/xllora/Crochet). Also feel free to visit the issues page in case you run into question/problems/bugs.

Motivation

Crochet is a light weight web framework oriented to rapid prototyping of REST APIs. If you are looking for a Rails like framework written in Scala, please take a look at Lift at http://liftweb.net/ instead.

Crochet targets quick prototyping of REST APIs relying on the flexibility of the Scala language. The initial ideas for Crochet were inspired while reading Gabriele Renzi post on creating the STEP picoframework with Scala and the need for quickly prototyping APIs for pilot projects. Crochet also provides mechanisms to hide repetitive tasks involved with default responses and authentication/authorization piggybacking on the mechanics provided by application servers.

Who uses Crochet?

Crochet was born from the need for quickly prototyping REST APIs which required exposing legacy code written in Java. I have been actively using Crochet to provide REST APIs for a variety of projects developed at the National Center for Supercomputing Applications. One of the primary adopters and movers of Crochet is the Meandre Infrastructure for data-intensive computing developed under the SEASR project.

Crochet in 2 minuts

Before you start please check you have Scala installed on your system. You can find more information on how to get Scala up and running here.

  1. Get the latest Crochet jar from the Downloads section at GitHub and the third party dependencies.
  2. Copy the following code into a file named hello-world.scala.
    import crochet._
    new Crochet {
         get("/message") { 
             <html>
                   <head><title>Hello World</title></head>
                   <body><h1>Hello World!</h1></body>
             </html>
         }
    } on 8080
  3. Get your server up and running by running (please change the version number if needed)
    $ scala -cp crochet-0.1.4.jar:crochet-3dparty-libraries-0.1.X.jar hello-world.scala
  4. You just have your first _Crochet_ API up and running. You can check the API working by opening your browser and pointing it to http://localhost:8080/message and you should get the message Hello World! back.

    Where to go from here?

    You will find more information on the Crochet wiki at GitHub. The wiki contains basic information as a QuickStart guide (which also includes how to deal with static content), descriptions of the basic concepts used in Crochet, and several examples that can get up and running fast.

    Related posts:

    1. Meandre 2.0 Alpha Preview = Scala + MongoDB
    2. Meandre is going Scala
    3. Fast mutation implementation for genetic algorithms in Python

Meandre is going Scala

After quite a bit of experimenting with different alternatives, Meandre is moving into Scala. Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. This is not a radical process, but a gradual one while I am starting to revisit the infrastructure for the next […]

Related posts:

  1. Fast REST API prototyping with Crochet and Scala
  2. Meandre: Semantic-Driven Data-Intensive Flow Engine
  3. Meandre Infrastructure 1.4 RC1 tagged

After quite a bit of experimenting with different alternatives, Meandre is moving into Scala. Scala is a general purpose programming language designed to express common programming patterns in a concise, elegant, and type-safe way. This is not a radical process, but a gradual one while I am starting to revisit the infrastructure for the next major release. Scala also generates code for the JVM making mix and match trivial. I started fuzzing around with Scala back when I started the development of Meandre during the summer of 2007, however I did fall back to Java since that was what most of the people in the group was comfortable with. I was fascinated with Scala fusion of object oriented programming and functional programming. Time went by and the codebase has grown to a point that I cannot stand anymore cutting through the weeds of Java when I have to extend the infrastructure or do bug fixing—not to mention its verbosity even for writing trivial code.

This summer I decided to go on a quest to get me out of the woods. I do not mind relying on the JVM and the large collection of libraries available, but I would also like to get my sanity back. Yes, I tested some of the usual suspects for the JVM (Jython, JRuby, Clojure, and Groovy) but not quite what I wanted. For instance, I wrote most of the Meandre infrastructure services using Jython (much more concise than Java), but still not quite happy to jump on that boat. Clojure is also interesting (functional programming) but it would be hard to justify for the group to move into it since not everybody may feel comfortable with a pure functional language. I also toyed with some not-so-usual ones like Erlang and Haskell, but again, I ended up with no real argument that could justify such a decision.

So, as I started doing back in 2007, I went back to my original idea of using Scala and its mixed object-oriented- and functional-programming- paradigm. To test it seriously, I started developing the distributed execution engine for Meandre in Scala using its Earlang-inspired actors. And, boom, suddenly I found myself spending more time thinking that writing/debugging threaded/networking code :D . Yes, I regret my 2007 decision instead of running with my original intuition, but better late than never. With a working seed of the distributed engine working and tested (did I mention that scalacheck and specs are really powerful tools for behavior driven development?), I finally decided to start gravitating the Meandre infrastructure development effort from Java to Scala—did I mention that Scala is Martin Odersky’s child? Yes, such a decision has some impact on my colleagues, but I envision that the benefits will eventually weight out the initial resistance and step learning curve. At least, the last two group meetings nobody jumped off the window while presenting the key elements of Scala, and demonstrating how concise and elegant it made the first working seed of the distributed execution engine :D . We even got in discussions about the benefits of using Scala if it delivered everything I showed. I am lucky to work with such smart guys. If you want to take a peek at the distributed execution engine (a.k.a. Snowfield) at SEASR’s Fisheye.

Oh, one last thing. Are you using Atlassian’s Fisheye? Do you want syntax highlighting for Scala? I tweaked the Java definitions to make it highlight Scala code. Remember to drop the scala.def file on $FISHEYE_HOME/syntax directory add an entry on the filename.map to make it highlight anything with extension .scala.

Related posts:

  1. Fast REST API prototyping with Crochet and Scala
  2. Meandre: Semantic-Driven Data-Intensive Flow Engine
  3. Meandre Infrastructure 1.4 RC1 tagged

Temporary storage for Meandre’s distributed flow execution

Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such […]

Related posts:

  1. Easy, reliable, and flexible storage for Python
  2. ZooKeeper and distributed applications
  3. Meandre: Semantic-Driven Data-Intensive Flow Engine

Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such storage to the node containing the consumer to be fed. Such storage, local to each node, must guarantee at least three basic properties.

  • Transaction ready
  • Light weight implementation
  • Efficient write and read to minimize the contention on ports

Also, it is important to keep in mind that in a distributed execution scenario, each node requires to have its one separated and standalone storage system. Thus, it is also important to minimize the overhead of installation and maintenance of such storage subsystem. There are several alternatives available ranging from traditional relational data base systems to home-brewed solutions. Relational data base systems provide a distributed, reliable, stable, and well tested environment, but they may tend to require a quite involved installation and maintenance. Also, tuning those systems to optimize performance may required quite an involved monitoring and tweaking. On the other hand, home-brewed solutions can be optimized for performance by dropping non required functionality and focussing on writing and reading performance. However, such solutions tend to be bug prone and tend to become time consuming, not to mention that proving transaction correctness can be quite involved.

Fortunately there is a middle ground where efficient and stable transaction aware solutions are available. They may not provide SQL interfaces, but they still provide transaction boundaries. Also, since they are oriented to maximize performance, they can provide better throughput and operation latency than having to traverse the SQL stack. Examples of such storage systems can be found under the areas of key-value stores and column stores. Several options were considered while writing these line, but key-value stores were the ones that better matches the three requirements described above. Several options were informally tested, including solutions like HDF and Berkely DB, however the best performing by far under similar stress test conditions as the sketched temporary storage subsystem was Tokyo Cabinet. I already introduced and tested Tokyo Cabinet more than a year ago, but this time I was going to give it a stress test to basically convince myself that that was what I wanted to use for as temporary storage of the distributed flow execution.

The experiment

Tokyo cabinet is a collection of storage utilities including, among other facilities, key-value stores implemented as hash files or B-trees and flexible column stores. To illustrate the performance and throughput you can achieve. To implement multiple queues on a single casket (Tokyo Cabinet file containing the data store) B-trees with duplicated keys can help achieving such goal. The duplicated keys are the queue names, and the values are the UUIDs of the objects being store. Objects are also stored in the same B-tree by using the UIUD as a key and the value become the payload to store (usually an array of bytes).

Previously, I have been heavily using Python bindings to test Tokyo Cabinet, but this time I went down the Java route (since the Meandre infrastructure is written on Java). The Java bindings are basically build around JNI and statically link to the C version of Tokyo Cabinet library, giving away the best of both world. To measure how fast can I write data out of a port into the local storage in a transactional mode, I used the following piece of code.

	public static void main ( String args [] ) {
		int MAX = 10000000;
		int inc = 10;
		int cnt = 0;
		float fa [] = new float[8];
		int reps = 10;
 
		for ( int i=1 ; i<=MAX ; i*=inc  ) {
			//System.out.println("Size: "+i);
			for ( int j=0 ; j<reps ; j++ ) {	
				//System.out.println("\tRepetition: "+j);
 
				// open the database
				BDB bdb = new BDB();
 
				if(!bdb.open(TEST_CASKET_TCB, BDB.OWRITER | BDB.OCREAT | BDB.OTSYNC )){
					int ecode = bdb.ecode();
					fail("open error: " + bdb.errmsg(ecode));
				}
 
				// Add a bunch of duplicates
				long start = System.currentTimeMillis();
				bdb.tranbegin();
				for ( int k=0; k<i; k++ ) {
					String uuid = UUID.randomUUID().toString();
					bdb.putdup(QUEUE_KEY, uuid);
					bdb.putdup(uuid.getBytes(), uuid.getBytes());	
				}
				bdb.trancommit();
				fa[cnt] += System.currentTimeMillis()-start;
 
				// Clean up
				bdb.close();
				new File(TEST_CASKET_TCB).delete();
			}
			fa[cnt] /= reps;
			System.out.println(""+i+"\t"+fa[cnt]+"\t"+(fa[cnt]/i));
			cnt++;
		}
	}

The idea is very simple. Just go and star storing 1, 10, 100, 1000, 10000, 1000000, and 10000000 pieces of data at once in a transaction. Measure the time. For each data number repeat the operation 10 times and average the time trying to palliate the fact that the experiment was run on a laptop running all sorts of other concurrent applications. Plot the results to illustrate:

  1. time required to insert one piece of data as a function of the number of data involve in the transaction
  2. number of pieces of data wrote per second as a function of the number of data involve in the transaction

The idea is to expose the behavior of Tokyo Cabinet as more data is involved in a transaction to check if degradation happens as the volume increase. This is an important issue, since data intensive flows can generate large volumes of data per firing event.

The results

Results are displayed on the figures below.

Time per data unit as a function of number of data involve in a transactionThroughput as a function of number of data in a transaction

The first important element to highlight is that the time to insert one data element does not degrade as the volume increase. Actually, it is quite interesting that Tokyo Cabinet feels more comfortable as the volume per transaction grows. The throughput results are also interesting, since it shows that it is able to sustain transfers of around 40K data units per second, and that the only bottleneck is the disk cache management and bandwidth to the disk itself—which gets saturated after pushing more than 10K pieces of data.

The lessons learned

Tokyo Cabinet is a excellent candidate to support the temporary transactional storage required in a distributed execution of a Meandre flow. Other alternatives like MySQL, embedded Apache Derby, the Java edition of Berkeley DB, SQLite JDBC could not get even get close to such performance falling at least one order of magnitude behind.

Related posts:

  1. Easy, reliable, and flexible storage for Python
  2. ZooKeeper and distributed applications
  3. Meandre: Semantic-Driven Data-Intensive Flow Engine

Liquid: RDF endpoint for FluidDB

A while ago I wrote some thoughts about how to map RDF to and from FluidDB. There I explored how you could map RDF onto FluidDB, and how to get it back. That got me thinking about how to get a simple endpoint you could query for RDF. Imagine that you could pull FluidDB data […]

Related posts:

  1. Liquid: RDF meandering in FluidDB
  2. Temporary storage for Meandre’s distributed flow execution
  3. Efficient serialization for Java (and beyond)

A while ago I wrote some thoughts about how to map RDF to and from FluidDB. There I explored how you could map RDF onto FluidDB, and how to get it back. That got me thinking about how to get a simple endpoint you could query for RDF. Imagine that you could pull FluidDB data in RDF, then I could just get all the flexibility of SPARQL for free. With this idea in my mind I just went and grabbed Meandre, the JFLuidDB library started by Ross Jones, and build a few components.

The main goal was to be able to get an object, list of the tags, and express the result in RDF. FluidDB helps the mapping since objects are uniquely identified by URIs. For instance, the unique object 5ff74371-455b-4299-83f9-ba13ae898ad1 (FluidDB relies on UUID version four with the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx) is uniquely identified by http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 (or a url of the form http://sandbox.fluidinfo.com/objects/xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx), in case you are using the sandbox or http://fluiddb.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 if you are using the main instance. Same story for tags. The tag fluiddb/about can be uniquely identified by the URI http://sandbox.fluidinfo.com/tags/fluiddb/about, or http://fluiddb.fluidinfo.com/tags/fluiddb/about.

A simple RDF description for and object

Once you get the object back the basic translated RDF version for object a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db should look like as the listing below in TURTLE notation.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> .

I will break the above example into small chunks and explain the above example into the three main pieces involved (the id, the about, and the tags). The basic construct is simple. First a triple to mark the object as a FluidDB object.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/>   
.

Then if the object has an about associated on creation, another triple gets generated and added, as shown below. To be consistent, I suggest reusing DC description since that is what the about for an object tend to indicate.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> 
.

Finally, if there are tags associated to the object, a bag gets created, and all the URI describing the tags get pushed into the bag as shown below.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description>
.

Creating and RDF endpoint

Armed with the previous, the thing should be easy. Just allow querying for objects, then collect the object information, and finally generate the final RDF. Using Meandre and JFLuidDB I wrote a few components that allow the simple creation of such an endpoint as illustrated by the picture below.

Meandre FluidDB RDF endpoint

The basic mechanism is simple. Just push the query into the Query for objects component. This component will stream each of the uuid of the matched objects to Read object which pulls the object information. Then the object is passed to Object to RDF model that basically generates the RDF snipped shown in the example shown above for each of the objects pushed. Finally all the RDF fragments are reduced together by component Wrapped models reducer. Then the resulting RDF model just gets serialize into text using the Turtle notation. Finally the serialized text is printed to the console. The equivalent code could be express as a ZigZag script as:

#
# Imports eliminated for clarity
#

#
# Create the component aliases
#
alias  as OBJECT_TO_RDF
alias  as PRINT_OBJECT
alias  as QUERY_FOR_OBJECTS
alias  as READS_THE_REQUESTED_OBJECT
alias  as WRAPPED_MODELS_REDUCER
alias  as MODEL_TO_RDF_TEXT
alias  as PUSH_STRING

#
# Create the component instances
#
push_query_string = PUSH_STRING()
wrapped_models_reducer = WRAPPED_MODELS_REDUCER()
query_for_objects = QUERY_FOR_OBJECTS()
reads_object = READS_THE_REQUESTED_OBJECT()
model_to_rdf_text = MODEL_TO_RDF_TEXT()
print_rdf_text = PRINT_OBJECT()
object_to_rdf_model = OBJECT_TO_RDF()

#
# Set component properties
#
push_query_string.message = "has fluiddb/tag/path"
query_for_objects.fluiddb_url = "http://sandbox.fluidinfo.com"
eads_object.fluiddb_url = "http://sandbox.fluidinfo.com"
model_to_rdf_text.rdf_dialect = "TTL"

#
# Create the flow by connecting the components
#
@query_for_objects_outputs = query_for_objects()
@model_to_rdf_text_outputs = model_to_rdf_text()
@push_query_string_outputs = push_query_string()
@object_to_rdf_model_outputs = object_to_rdf_model()
@reads_object_outputs = reads_object()
@wrapped_models_reducer_outputs = wrapped_models_reducer()

query_for_objects(text: push_query_string_outputs.text)
model_to_rdf_text(model: wrapped_models_reducer_outputs.model)
object_to_rdf_model(object: reads_object_outputs.object)
reads_object(uuid: query_for_objects_outputs.uuid)[+200!]
print_rdf_text(object: model_to_rdf_text_outputs.text)
wrapped_models_reducer(model: object_to_rdf_model_outputs.model)

The only interesting element in the script is the [+200!] entry that creates 200 parallel copies of read object that will concurrently hit FluidDB to pull the data, trying to minimize the latency. The script could be compiled into a MAU and run. The output of the execution would look like the following:

$ java -jar zzre-1.4.7.jar pull-test.mau 
Meandre MAU Executor [1.0.1vcli/1.4.7]
All rights reserved by DITA, NCSA, UofI (2007-2009)
THIS SOFTWARE IS PROVIDED UNDER University of Illinois/NCSA OPEN SOURCE LICENSE.
 
Executing MAU file pull-test.mau
Creating temp dir pull-test.mau.run
Creating temp dir pull-test.mau.public_resources
 
Preparing flow: meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/
2009-09-24 12:34:38.480::INFO:  jetty-6.1.x
2009-09-24 12:34:38.495::INFO:  Started SocketConnector@0.0.0.0:1715
Preparation completed correctly
 
Execution started at: 2009-09-24T12:34:38
----------------------------------------------------------------------------
<http://sandbox.fluidinfo.com/objects/a24b4a18-5483-47c6-9b62-0955210c7ebd>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253772095.82845-0.944567286499904"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253622685.3231461-0.437099602163897316"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/67e52346-527e-4bb7-b8f3-05fa8a8ae35b>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253620190.69175-0.861614257420541"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/8a65a184-03d9-4881-95df-02fa0561a86f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/update/exceptions"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/335b44e9-a72f-479d-ad60-3661a35231ba>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253776141.95577-0.284175700598524"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/3bbf1cc6-731c-4e56-a664-adeb5484334f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/delete/policy"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/aba5adcf-fd44-40ab-b702-9cc635650bc3>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253614713.757-0.604769721717702"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/f61ceb3b-33df-4356-8e7d-c56d3d0ae338>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253615887.80879-0.0437609496034099"^^<http://www.w3.org/2001/XMLSchema#string> .
 
...

That’s it! A first RDF dump of the query!

The not so great news

The current FluidDB API does not provide any method to be able to pull data from more than one object at once. That basically means, that for each uuid a call to the server needs to be process. That is a huge latency overhead. The FluidDB guys know about it and they are scratching their heads on how to provide a “multi get”. A full trace of the output can be found on this FluidDB RDF endpoint trace.

This element is crucial for any RDF endpoint. Above I left out a basic element, the time measures. That part looks like:

Flow execution statistics

Flow unique execution ID : meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/8D8E354A/1253813678323/1493255769/
Flow state               : ended
Started at               : Thu Sep 24 12:34:38 CDT 2009
Last update              : Thu Sep 24 12:37:28 CDT 2009
Total run time (ms)      : 170144

Basically 170s to pull only 238 objects, where all the time is spent round tripping to FluidDB.

Getting there

This basically means that such high latency would not allow efficient interactive usage of the end point. However, this exercise was useful to prof that simple RDF endpoints for FluidDB are possible and would greatly boost the flexibility of interaction with FluidDB . The current form of the endpoint is may still have value if you are not in a hurry, allowing you to run SPARQL queries against FluidDB data and get the best of both worlds.

The code use

If you are interested on running the code, you may need Meandre and the components I put together for the experiment, that you can get from http://github.com/xllora/liquid.

Related posts:

  1. Liquid: RDF meandering in FluidDB
  2. Temporary storage for Meandre’s distributed flow execution
  3. Efficient serialization for Java (and beyond)

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of […]

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of the talk below, as well as the slides I used.

Abstract: One hundred and fifty years have passed since the publication of Darwin’s world-changing manuscript “The Origins of Species by Means of Natural Selection”. Darwin’s ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin’s ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce

Liquid: RDF meandering in FluidDB

Meandre (NCSA pushed data-intensive computing infrastructure) relies on RDF to describe components, flows, locations and repositories. RDF has become the central piece that makes possible Meandre’s flexibility and reusability. However, one piece still remains largely sketchy and still has no clear optimal solution: How can we facilitate to anybody sharing, publishing and annotating flows, components, […]

Related posts:

  1. Liquid: RDF endpoint for FluidDB
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Meandre 1.4.0 final release candidate tagged

Meandre (NCSA pushed data-intensive computing infrastructure) relies on RDF to describe components, flows, locations and repositories. RDF has become the central piece that makes possible Meandre’s flexibility and reusability. However, one piece still remains largely sketchy and still has no clear optimal solution: How can we facilitate to anybody sharing, publishing and annotating flows, components, locations and repositories? More importantly, how can that be done in the cloud in an open-ended fashion and allow anybody to annotate and comment on each of the afore mentioned pieces?

The FluidDB trip

During my last summer trip to Europe, Terry Jones (CEO) invited me to visit FluidInfo (based in Barcelona) where I also meet Esteve Fernandez (CTO). I had a great opportunity to chat with the masterminds behind an intriguing concept I ran into after a short note I received from David E. Goldberg. FluidDB, the main product being pushed by FluidInfo, is an online collaborative “cloud” database. On FluidInfo words:

FluidDB lets data be social. It allows almost unlimited information personalization by individual users and applications, and also between them. This makes it simple to build a wide variety of applications that benefit from cooperation, and which are open to unanticipated future enhancements. Even more importantly, FluidDB facilitates and encourages the growth of applications that leave users in control of their own data.

FluidDB went live on a private alpha last week. The basic concept behind the scenes is simple. FluidDB stores objects. Objects do not belong to anybody. Objects may be “blank” or they may be about something (e.g. http://seasr.org/meandre). You can create as many blank objects as you want. Creating an object with the same about always returns the same object (thus, there will only be one object about http://seasr.org/meandre). Once objects exists, things start getting more interesting, you can go and tag any object with whatever tag you want. For instance I could tag the http://seasr.org/meandre object hosted_by tag, and assign the tag the value FluidDB introduces one last trick: namespaces. For instance, I got xllora. that means that the above tag I mentioned would look like /tag/xllora/hosted_by. You can create as many nested namespaces under your main namespace as you want. FluidDB also provides mechanisms to control who can query and see the values of your created tags.

As you can see, the basic object model and mechanics is very simple. When the alpha went live, FluidDB only provide access via a simple REST-like HTTP API. In a few days a blossom of client libraries that wrap that API were develop by a dynamic community that gather on #fluiddb channel on irc.freenode.net where FluidDB

You were saying something about RDF

Back to the point. One thing I chatted with the FluidDB guys was what did they think about the similarities between FluidDB’s object model and RDF. After playing with RDF for a while, the FluidDB model look awfully familiar, despite a much simplified and manageable model than RDF. They did not have much to say about it, and the question got stuck in the back of my mind. So when I got access to the private alpha, I could not help it but get down the path of what would it mean to map RDF on FluidDB. Yes, the simple straight answer would be to stick serialized RDF into the value of a given tag (e.g. xllora/rdf). However, that option seemed poor, since I could not exploit the social aspect of collaborative annotations provided by FluidDB. So back to the drawing board. What both models have in common: They are both descriptions about something. In RDF you can see those as the subjects of the triple predicates, whereas in FluidDB those are simple objects. RDF use properties to qualify objects. FluidDB uses tags. Both enable you to add value to qualified objects. Mmh, there you go.

With this idea in mind, I started Liquid, a simple proof-of-concept library that maps RDF on to FluidDB and then it gets it back. There was only one thing that needed a bit of patching. RDF properties are arbitrary URIs. Those could not be easily map on the top of FluidDB tags, so I took a simple compromise route.

  • RDFs subject URIs are mapped onto FluidDB qualified objects via the about tag
  • One FluidDB tag will contain all the properties for that object (basically a simple dictionary encoded in JSON)
  • Reference to other RDF URIs will be mapped on to FluidDB object URIs, and vice versa

Let’s make it a bit more chewable with a simple example.

<?xml version="1.0"?>
 
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http://www.recshop.fake/cd#">
 
<rdf:Description
rdf:about="http://www.recshop.fake/cd/Empire Burlesque">
  <cd:artist>Bob Dylan</cd:artist>
 </rdf:Description>
 
</rdf:RDF>

The above RDF represents a single triple

http://www.recshop.fake/cd/Empire Burlesque	http://www.recshop.fake/cd#artist	   "Bob Dylan"

This triple could be map onto FluidDB by creating one qualified FluidDB object and adding the proper tags. The example below shows how to do so using Python’s fdb.py client library by Nicholas J. Radcliffe.

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Create the tags (if they exist, this won't hurt)
#
f.create_abstract_tag(__RDF_TAG__)
f.create_abstract_tag(__RDF_TAG_PROPERTIES__)
f.create_abstract_tag(__RDF_TAG_MODEL_NAME__)
#
# Create the subject object of the triple
#	
o = f.create_object('http://www.recshop.fake/cd/Empire Burlesque')
#
# Map RDF properties
#
properties = {'http://www.recshop.fake/cd#artist':['Bob Dylan']}
#
# Tag the object as RDF aware, properties available, and to which model/named graph 
# it belongs
#
f.tag_object_by_id(o.id, __RDF_TAG__)
f.tag_object_by_id(o.id,__RDF_TAG_PROPERTIES__,value=json.dumps(properties))
f.tag_object_by_id(o.id, __RDF_TAG_MODEL_NAME__,'test_dummy')

Running along with this basic idea, I quickly stitched a simple library (Liquid) that allows ingestion and retrieval of RDF from FluidDB. It is still very rudimentary and may not totally map properly all possible RDF, but it is a working proof-of-concept implementation that it is possible to do so.

The Python code above just saves a triple. You can easy retrieve the triple by performing the following operation

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Retrieve the annotated objects
#
objs = f.query('has xllora/%s'%(__RDF_TAG__))
#
# Optionally you could retrieve the ones only belonging to a given model by
#
# objs = fdb.query('has xllora/%s and xllora/%s matches "%s"'%(__RDF_TAG__,__RDF_TAG_MODEL_NAME__,modelname))
#
subs = [f.get_tag_value_by_id(s,'/tags/fluiddb/about') for s in objs]
props_tmp = [f.get_tag_value_by_id(s,'/tags/xllora/'+__RDF_TAG_PROPERTIES__) for s in objs]
props = [json.loads(s[1]) if s[0]==200 else {} for s in props_tmp]

Now subs contains all the subject URIs for the predicates, and props all the dictionaries containing the properties.

The bottom line

OK. So, what is this mapping important? Basically, it will allow collaborative tagging of the created objects (subjects), allowing a collaborative and social gathering of information, besides them mapped RDF. So, what does it all means?

It basically means, that if you do not have the need to ingest RDF (where property URIs are not directly map and you need to Fluidify/reify), any data stored in FluidDB is already on some form of triplified RDF. Let me explain what I mean by that. Each FluidDB has a unique URI (e.g. http://fluidDB.fluidinfo.com/objects/4fdf7ff4-f0da-4441-8e63-9b98ed26fc12). Each tag is also uniquely identified by an URI (e.g. http://fluidDB.fluidinfo.com/tags/xllora/rdf_model_name). And finally each pair object/tag may have a value (e.g. a literal 'test_dummy' or maybe another URI http://fluidDB.fluidinfo.com/objects/a0dda173-9ee0-4799-a507-8710045d2b07). If a object/tag does not have a value you can just point it to the no value URI (or some other convention you like).

Having said that, now you have all the pieces to express FluidDB data in plain shareable RDF. That would mean basically get all the tags for and object, query the values, and then just generate and RDF model by adding the gathered triples. That’s easy. Also, if you align your properties to tags, the ingestion would also become that trivial. I will try to get that piece into Liquid as soon as other issues allow me to do so :D .

Just to close, I would mention once again a key element of this picture. FluidDB opens the door to a truly cooperative, distributed, and online fluid semantic web. It is one of the first examples of how annotations (a.k.a. metadata) can be easily gathered and used on the “cloud” for the masses. Great job guys!

Related posts:

  1. Liquid: RDF endpoint for FluidDB
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Meandre 1.4.0 final release candidate tagged