Temporary storage for Meandre’s distributed flow execution

Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such […]

Related posts:

  1. Easy, reliable, and flexible storage for Python
  2. ZooKeeper and distributed applications
  3. Meandre: Semantic-Driven Data-Intensive Flow Engine

Designing the distributed execution of a generic Meandre flow involves several moving pieces. One of those is the temporary storage required by the computing nodes (think of it as one node as one isolated component of a flow) to keep up with the data generated by a component, and also be able to replicate such storage to the node containing the consumer to be fed. Such storage, local to each node, must guarantee at least three basic properties.

  • Transaction ready
  • Light weight implementation
  • Efficient write and read to minimize the contention on ports

Also, it is important to keep in mind that in a distributed execution scenario, each node requires to have its one separated and standalone storage system. Thus, it is also important to minimize the overhead of installation and maintenance of such storage subsystem. There are several alternatives available ranging from traditional relational data base systems to home-brewed solutions. Relational data base systems provide a distributed, reliable, stable, and well tested environment, but they may tend to require a quite involved installation and maintenance. Also, tuning those systems to optimize performance may required quite an involved monitoring and tweaking. On the other hand, home-brewed solutions can be optimized for performance by dropping non required functionality and focussing on writing and reading performance. However, such solutions tend to be bug prone and tend to become time consuming, not to mention that proving transaction correctness can be quite involved.

Fortunately there is a middle ground where efficient and stable transaction aware solutions are available. They may not provide SQL interfaces, but they still provide transaction boundaries. Also, since they are oriented to maximize performance, they can provide better throughput and operation latency than having to traverse the SQL stack. Examples of such storage systems can be found under the areas of key-value stores and column stores. Several options were considered while writing these line, but key-value stores were the ones that better matches the three requirements described above. Several options were informally tested, including solutions like HDF and Berkely DB, however the best performing by far under similar stress test conditions as the sketched temporary storage subsystem was Tokyo Cabinet. I already introduced and tested Tokyo Cabinet more than a year ago, but this time I was going to give it a stress test to basically convince myself that that was what I wanted to use for as temporary storage of the distributed flow execution.

The experiment

Tokyo cabinet is a collection of storage utilities including, among other facilities, key-value stores implemented as hash files or B-trees and flexible column stores. To illustrate the performance and throughput you can achieve. To implement multiple queues on a single casket (Tokyo Cabinet file containing the data store) B-trees with duplicated keys can help achieving such goal. The duplicated keys are the queue names, and the values are the UUIDs of the objects being store. Objects are also stored in the same B-tree by using the UIUD as a key and the value become the payload to store (usually an array of bytes).

Previously, I have been heavily using Python bindings to test Tokyo Cabinet, but this time I went down the Java route (since the Meandre infrastructure is written on Java). The Java bindings are basically build around JNI and statically link to the C version of Tokyo Cabinet library, giving away the best of both world. To measure how fast can I write data out of a port into the local storage in a transactional mode, I used the following piece of code.

	public static void main ( String args [] ) {
		int MAX = 10000000;
		int inc = 10;
		int cnt = 0;
		float fa [] = new float[8];
		int reps = 10;
 
		for ( int i=1 ; i<=MAX ; i*=inc  ) {
			//System.out.println("Size: "+i);
			for ( int j=0 ; j<reps ; j++ ) {	
				//System.out.println("\tRepetition: "+j);
 
				// open the database
				BDB bdb = new BDB();
 
				if(!bdb.open(TEST_CASKET_TCB, BDB.OWRITER | BDB.OCREAT | BDB.OTSYNC )){
					int ecode = bdb.ecode();
					fail("open error: " + bdb.errmsg(ecode));
				}
 
				// Add a bunch of duplicates
				long start = System.currentTimeMillis();
				bdb.tranbegin();
				for ( int k=0; k<i; k++ ) {
					String uuid = UUID.randomUUID().toString();
					bdb.putdup(QUEUE_KEY, uuid);
					bdb.putdup(uuid.getBytes(), uuid.getBytes());	
				}
				bdb.trancommit();
				fa[cnt] += System.currentTimeMillis()-start;
 
				// Clean up
				bdb.close();
				new File(TEST_CASKET_TCB).delete();
			}
			fa[cnt] /= reps;
			System.out.println(""+i+"\t"+fa[cnt]+"\t"+(fa[cnt]/i));
			cnt++;
		}
	}

The idea is very simple. Just go and star storing 1, 10, 100, 1000, 10000, 1000000, and 10000000 pieces of data at once in a transaction. Measure the time. For each data number repeat the operation 10 times and average the time trying to palliate the fact that the experiment was run on a laptop running all sorts of other concurrent applications. Plot the results to illustrate:

  1. time required to insert one piece of data as a function of the number of data involve in the transaction
  2. number of pieces of data wrote per second as a function of the number of data involve in the transaction

The idea is to expose the behavior of Tokyo Cabinet as more data is involved in a transaction to check if degradation happens as the volume increase. This is an important issue, since data intensive flows can generate large volumes of data per firing event.

The results

Results are displayed on the figures below.

Time per data unit as a function of number of data involve in a transactionThroughput as a function of number of data in a transaction

The first important element to highlight is that the time to insert one data element does not degrade as the volume increase. Actually, it is quite interesting that Tokyo Cabinet feels more comfortable as the volume per transaction grows. The throughput results are also interesting, since it shows that it is able to sustain transfers of around 40K data units per second, and that the only bottleneck is the disk cache management and bandwidth to the disk itself—which gets saturated after pushing more than 10K pieces of data.

The lessons learned

Tokyo Cabinet is a excellent candidate to support the temporary transactional storage required in a distributed execution of a Meandre flow. Other alternatives like MySQL, embedded Apache Derby, the Java edition of Berkeley DB, SQLite JDBC could not get even get close to such performance falling at least one order of magnitude behind.

Related posts:

  1. Easy, reliable, and flexible storage for Python
  2. ZooKeeper and distributed applications
  3. Meandre: Semantic-Driven Data-Intensive Flow Engine

Liquid: RDF endpoint for FluidDB

A while ago I wrote some thoughts about how to map RDF to and from FluidDB. There I explored how you could map RDF onto FluidDB, and how to get it back. That got me thinking about how to get a simple endpoint you could query for RDF. Imagine that you could pull FluidDB data […]

Related posts:

  1. Liquid: RDF meandering in FluidDB
  2. Temporary storage for Meandre’s distributed flow execution
  3. Efficient serialization for Java (and beyond)

A while ago I wrote some thoughts about how to map RDF to and from FluidDB. There I explored how you could map RDF onto FluidDB, and how to get it back. That got me thinking about how to get a simple endpoint you could query for RDF. Imagine that you could pull FluidDB data in RDF, then I could just get all the flexibility of SPARQL for free. With this idea in my mind I just went and grabbed Meandre, the JFLuidDB library started by Ross Jones, and build a few components.

The main goal was to be able to get an object, list of the tags, and express the result in RDF. FluidDB helps the mapping since objects are uniquely identified by URIs. For instance, the unique object 5ff74371-455b-4299-83f9-ba13ae898ad1 (FluidDB relies on UUID version four with the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx) is uniquely identified by http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 (or a url of the form http://sandbox.fluidinfo.com/objects/xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx), in case you are using the sandbox or http://fluiddb.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1 if you are using the main instance. Same story for tags. The tag fluiddb/about can be uniquely identified by the URI http://sandbox.fluidinfo.com/tags/fluiddb/about, or http://fluiddb.fluidinfo.com/tags/fluiddb/about.

A simple RDF description for and object

Once you get the object back the basic translated RDF version for object a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db should look like as the listing below in TURTLE notation.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> .

I will break the above example into small chunks and explain the above example into the three main pieces involved (the id, the about, and the tags). The basic construct is simple. First a triple to mark the object as a FluidDB object.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/>   
.

Then if the object has an about associated on creation, another triple gets generated and added, as shown below. To be consistent, I suggest reusing DC description since that is what the about for an object tend to indicate.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/default/tags/permission/update/policy"^^<http://www.w3.org/2001/XMLSchema#string> 
.

Finally, if there are tags associated to the object, a bag gets created, and all the URI describing the tags get pushed into the bag as shown below.

<http://sandbox.fluidinfo.com/objects/a10ab0f3-ef56-4fc0-a8fa-4d452d8ab1db>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description>
.

Creating and RDF endpoint

Armed with the previous, the thing should be easy. Just allow querying for objects, then collect the object information, and finally generate the final RDF. Using Meandre and JFLuidDB I wrote a few components that allow the simple creation of such an endpoint as illustrated by the picture below.

Meandre FluidDB RDF endpoint

The basic mechanism is simple. Just push the query into the Query for objects component. This component will stream each of the uuid of the matched objects to Read object which pulls the object information. Then the object is passed to Object to RDF model that basically generates the RDF snipped shown in the example shown above for each of the objects pushed. Finally all the RDF fragments are reduced together by component Wrapped models reducer. Then the resulting RDF model just gets serialize into text using the Turtle notation. Finally the serialized text is printed to the console. The equivalent code could be express as a ZigZag script as:

#
# Imports eliminated for clarity
#

#
# Create the component aliases
#
alias  as OBJECT_TO_RDF
alias  as PRINT_OBJECT
alias  as QUERY_FOR_OBJECTS
alias  as READS_THE_REQUESTED_OBJECT
alias  as WRAPPED_MODELS_REDUCER
alias  as MODEL_TO_RDF_TEXT
alias  as PUSH_STRING

#
# Create the component instances
#
push_query_string = PUSH_STRING()
wrapped_models_reducer = WRAPPED_MODELS_REDUCER()
query_for_objects = QUERY_FOR_OBJECTS()
reads_object = READS_THE_REQUESTED_OBJECT()
model_to_rdf_text = MODEL_TO_RDF_TEXT()
print_rdf_text = PRINT_OBJECT()
object_to_rdf_model = OBJECT_TO_RDF()

#
# Set component properties
#
push_query_string.message = "has fluiddb/tag/path"
query_for_objects.fluiddb_url = "http://sandbox.fluidinfo.com"
eads_object.fluiddb_url = "http://sandbox.fluidinfo.com"
model_to_rdf_text.rdf_dialect = "TTL"

#
# Create the flow by connecting the components
#
@query_for_objects_outputs = query_for_objects()
@model_to_rdf_text_outputs = model_to_rdf_text()
@push_query_string_outputs = push_query_string()
@object_to_rdf_model_outputs = object_to_rdf_model()
@reads_object_outputs = reads_object()
@wrapped_models_reducer_outputs = wrapped_models_reducer()

query_for_objects(text: push_query_string_outputs.text)
model_to_rdf_text(model: wrapped_models_reducer_outputs.model)
object_to_rdf_model(object: reads_object_outputs.object)
reads_object(uuid: query_for_objects_outputs.uuid)[+200!]
print_rdf_text(object: model_to_rdf_text_outputs.text)
wrapped_models_reducer(model: object_to_rdf_model_outputs.model)

The only interesting element in the script is the [+200!] entry that creates 200 parallel copies of read object that will concurrently hit FluidDB to pull the data, trying to minimize the latency. The script could be compiled into a MAU and run. The output of the execution would look like the following:

$ java -jar zzre-1.4.7.jar pull-test.mau 
Meandre MAU Executor [1.0.1vcli/1.4.7]
All rights reserved by DITA, NCSA, UofI (2007-2009)
THIS SOFTWARE IS PROVIDED UNDER University of Illinois/NCSA OPEN SOURCE LICENSE.
 
Executing MAU file pull-test.mau
Creating temp dir pull-test.mau.run
Creating temp dir pull-test.mau.public_resources
 
Preparing flow: meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/
2009-09-24 12:34:38.480::INFO:  jetty-6.1.x
2009-09-24 12:34:38.495::INFO:  Started SocketConnector@0.0.0.0:1715
Preparation completed correctly
 
Execution started at: 2009-09-24T12:34:38
----------------------------------------------------------------------------
<http://sandbox.fluidinfo.com/objects/a24b4a18-5483-47c6-9b62-0955210c7ebd>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253772095.82845-0.944567286499904"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/5ff74371-455b-4299-83f9-ba13ae898ad1>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253622685.3231461-0.437099602163897316"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/67e52346-527e-4bb7-b8f3-05fa8a8ae35b>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253620190.69175-0.861614257420541"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/8a65a184-03d9-4881-95df-02fa0561a86f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/update/exceptions"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/335b44e9-a72f-479d-ad60-3661a35231ba>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253776141.95577-0.284175700598524"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/3bbf1cc6-731c-4e56-a664-adeb5484334f>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute fluiddb/namespaces/permission/delete/policy"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/aba5adcf-fd44-40ab-b702-9cc635650bc3>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253614713.757-0.604769721717702"^^<http://www.w3.org/2001/XMLSchema#string> .
 
<http://sandbox.fluidinfo.com/objects/f61ceb3b-33df-4356-8e7d-c56d3d0ae338>
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
              <http://sandbox.fluidinfo.com/objects/> , <http://www.w3.org/1999/02/22-rdf-syntax-ns#Bag> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
              <http://sandbox.fluidinfo.com/tags/fluiddb/about> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/path> ;
      <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
              <http://sandbox.fluidinfo.com/tags/fluiddb/tags/description> ;
      <http://purl.org/dc/elements/1.1/description>
              "Object for the attribute test/Net::FluidDB-name-1253615887.80879-0.0437609496034099"^^<http://www.w3.org/2001/XMLSchema#string> .
 
...

That’s it! A first RDF dump of the query!

The not so great news

The current FluidDB API does not provide any method to be able to pull data from more than one object at once. That basically means, that for each uuid a call to the server needs to be process. That is a huge latency overhead. The FluidDB guys know about it and they are scratching their heads on how to provide a “multi get”. A full trace of the output can be found on this FluidDB RDF endpoint trace.

This element is crucial for any RDF endpoint. Above I left out a basic element, the time measures. That part looks like:

Flow execution statistics

Flow unique execution ID : meandre://seasr.org/zigzag/1253813636945/4416962494019783033/flow/pull-test-mau/8D8E354A/1253813678323/1493255769/
Flow state               : ended
Started at               : Thu Sep 24 12:34:38 CDT 2009
Last update              : Thu Sep 24 12:37:28 CDT 2009
Total run time (ms)      : 170144

Basically 170s to pull only 238 objects, where all the time is spent round tripping to FluidDB.

Getting there

This basically means that such high latency would not allow efficient interactive usage of the end point. However, this exercise was useful to prof that simple RDF endpoints for FluidDB are possible and would greatly boost the flexibility of interaction with FluidDB . The current form of the endpoint is may still have value if you are not in a hurry, allowing you to run SPARQL queries against FluidDB data and get the best of both worlds.

The code use

If you are interested on running the code, you may need Meandre and the components I put together for the experiment, that you can get from http://github.com/xllora/liquid.

Related posts:

  1. Liquid: RDF meandering in FluidDB
  2. Temporary storage for Meandre’s distributed flow execution
  3. Efficient serialization for Java (and beyond)

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of […]

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of the talk below, as well as the slides I used.

Abstract: One hundred and fifty years have passed since the publication of Darwin’s world-changing manuscript “The Origins of Species by Means of Natural Selection”. Darwin’s ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin’s ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce

Reinforcement Learning, Logic and Evolutionary Computation

Drew Mellor is pleased to announce the publication of his new LCS book. Reinforcement Learning, Logic and Evolutionary Computation: A Learning Classifier System Approach to Relational Reinforcement Learning, published by Lambert Academic Publishing (ISBN 978-3-8383-0196-9).

Abstract Reinforcement learning (RL) consists of methods that automatically adjust behaviour based on numerical rewards and penalties. While use of the attribute-value framework is widespread in RL, it has limited expressive power. Logic languages, such as first-order logic, provide a more expressive framework, and their use in RL has led to the field of relational RL. This thesis develops a system for relational RL based on learning classifier systems (LCS). In brief, the system generates, evolves, and evaluates a population of condition-action rules, which take the form of definite clauses over first-order logic. Adopting the LCS approach allows the resulting system to integrate several desirable qualities: model-free and “tabula rasa” learning; a Markov Decision Process problem model; and importantly, support for variables as a principal mechanism for generalisation. The utility of variables is demonstrated by the system’s ability to learn genuinely scalable behaviour – ! behaviour learnt in small environments that translates to arbitrarily large versions of the environment without the need for retraining.

Liquid: RDF meandering in FluidDB

Meandre (NCSA pushed data-intensive computing infrastructure) relies on RDF to describe components, flows, locations and repositories. RDF has become the central piece that makes possible Meandre’s flexibility and reusability. However, one piece still remains largely sketchy and still has no clear optimal solution: How can we facilitate to anybody sharing, publishing and annotating flows, components, […]

Related posts:

  1. Liquid: RDF endpoint for FluidDB
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Meandre 1.4.0 final release candidate tagged

Meandre (NCSA pushed data-intensive computing infrastructure) relies on RDF to describe components, flows, locations and repositories. RDF has become the central piece that makes possible Meandre’s flexibility and reusability. However, one piece still remains largely sketchy and still has no clear optimal solution: How can we facilitate to anybody sharing, publishing and annotating flows, components, locations and repositories? More importantly, how can that be done in the cloud in an open-ended fashion and allow anybody to annotate and comment on each of the afore mentioned pieces?

The FluidDB trip

During my last summer trip to Europe, Terry Jones (CEO) invited me to visit FluidInfo (based in Barcelona) where I also meet Esteve Fernandez (CTO). I had a great opportunity to chat with the masterminds behind an intriguing concept I ran into after a short note I received from David E. Goldberg. FluidDB, the main product being pushed by FluidInfo, is an online collaborative “cloud” database. On FluidInfo words:

FluidDB lets data be social. It allows almost unlimited information personalization by individual users and applications, and also between them. This makes it simple to build a wide variety of applications that benefit from cooperation, and which are open to unanticipated future enhancements. Even more importantly, FluidDB facilitates and encourages the growth of applications that leave users in control of their own data.

FluidDB went live on a private alpha last week. The basic concept behind the scenes is simple. FluidDB stores objects. Objects do not belong to anybody. Objects may be “blank” or they may be about something (e.g. http://seasr.org/meandre). You can create as many blank objects as you want. Creating an object with the same about always returns the same object (thus, there will only be one object about http://seasr.org/meandre). Once objects exists, things start getting more interesting, you can go and tag any object with whatever tag you want. For instance I could tag the http://seasr.org/meandre object hosted_by tag, and assign the tag the value FluidDB introduces one last trick: namespaces. For instance, I got xllora. that means that the above tag I mentioned would look like /tag/xllora/hosted_by. You can create as many nested namespaces under your main namespace as you want. FluidDB also provides mechanisms to control who can query and see the values of your created tags.

As you can see, the basic object model and mechanics is very simple. When the alpha went live, FluidDB only provide access via a simple REST-like HTTP API. In a few days a blossom of client libraries that wrap that API were develop by a dynamic community that gather on #fluiddb channel on irc.freenode.net where FluidDB

You were saying something about RDF

Back to the point. One thing I chatted with the FluidDB guys was what did they think about the similarities between FluidDB’s object model and RDF. After playing with RDF for a while, the FluidDB model look awfully familiar, despite a much simplified and manageable model than RDF. They did not have much to say about it, and the question got stuck in the back of my mind. So when I got access to the private alpha, I could not help it but get down the path of what would it mean to map RDF on FluidDB. Yes, the simple straight answer would be to stick serialized RDF into the value of a given tag (e.g. xllora/rdf). However, that option seemed poor, since I could not exploit the social aspect of collaborative annotations provided by FluidDB. So back to the drawing board. What both models have in common: They are both descriptions about something. In RDF you can see those as the subjects of the triple predicates, whereas in FluidDB those are simple objects. RDF use properties to qualify objects. FluidDB uses tags. Both enable you to add value to qualified objects. Mmh, there you go.

With this idea in mind, I started Liquid, a simple proof-of-concept library that maps RDF on to FluidDB and then it gets it back. There was only one thing that needed a bit of patching. RDF properties are arbitrary URIs. Those could not be easily map on the top of FluidDB tags, so I took a simple compromise route.

  • RDFs subject URIs are mapped onto FluidDB qualified objects via the about tag
  • One FluidDB tag will contain all the properties for that object (basically a simple dictionary encoded in JSON)
  • Reference to other RDF URIs will be mapped on to FluidDB object URIs, and vice versa

Let’s make it a bit more chewable with a simple example.

<?xml version="1.0"?>
 
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:cd="http://www.recshop.fake/cd#">
 
<rdf:Description
rdf:about="http://www.recshop.fake/cd/Empire Burlesque">
  <cd:artist>Bob Dylan</cd:artist>
 </rdf:Description>
 
</rdf:RDF>

The above RDF represents a single triple

http://www.recshop.fake/cd/Empire Burlesque	http://www.recshop.fake/cd#artist	   "Bob Dylan"

This triple could be map onto FluidDB by creating one qualified FluidDB object and adding the proper tags. The example below shows how to do so using Python’s fdb.py client library by Nicholas J. Radcliffe.

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Create the tags (if they exist, this won't hurt)
#
f.create_abstract_tag(__RDF_TAG__)
f.create_abstract_tag(__RDF_TAG_PROPERTIES__)
f.create_abstract_tag(__RDF_TAG_MODEL_NAME__)
#
# Create the subject object of the triple
#	
o = f.create_object('http://www.recshop.fake/cd/Empire Burlesque')
#
# Map RDF properties
#
properties = {'http://www.recshop.fake/cd#artist':['Bob Dylan']}
#
# Tag the object as RDF aware, properties available, and to which model/named graph 
# it belongs
#
f.tag_object_by_id(o.id, __RDF_TAG__)
f.tag_object_by_id(o.id,__RDF_TAG_PROPERTIES__,value=json.dumps(properties))
f.tag_object_by_id(o.id, __RDF_TAG_MODEL_NAME__,'test_dummy')

Running along with this basic idea, I quickly stitched a simple library (Liquid) that allows ingestion and retrieval of RDF from FluidDB. It is still very rudimentary and may not totally map properly all possible RDF, but it is a working proof-of-concept implementation that it is possible to do so.

The Python code above just saves a triple. You can easy retrieve the triple by performing the following operation

import fdb,sys
if sys.version_info < (2, 6):
    import simplejson as json
else:
    import json
 
__RDF_TAG__ = 'rdf'
__RDF_TAG_PROPERTIES__  = 'rdf_properties'
__RDF_TAG_MODEL_NAME__ = 'rdf_model_name'
 
#
# Initialize the FluidDB client library
#
f = fdb.FluidDB()
#
# Retrieve the annotated objects
#
objs = f.query('has xllora/%s'%(__RDF_TAG__))
#
# Optionally you could retrieve the ones only belonging to a given model by
#
# objs = fdb.query('has xllora/%s and xllora/%s matches "%s"'%(__RDF_TAG__,__RDF_TAG_MODEL_NAME__,modelname))
#
subs = [f.get_tag_value_by_id(s,'/tags/fluiddb/about') for s in objs]
props_tmp = [f.get_tag_value_by_id(s,'/tags/xllora/'+__RDF_TAG_PROPERTIES__) for s in objs]
props = [json.loads(s[1]) if s[0]==200 else {} for s in props_tmp]

Now subs contains all the subject URIs for the predicates, and props all the dictionaries containing the properties.

The bottom line

OK. So, what is this mapping important? Basically, it will allow collaborative tagging of the created objects (subjects), allowing a collaborative and social gathering of information, besides them mapped RDF. So, what does it all means?

It basically means, that if you do not have the need to ingest RDF (where property URIs are not directly map and you need to Fluidify/reify), any data stored in FluidDB is already on some form of triplified RDF. Let me explain what I mean by that. Each FluidDB has a unique URI (e.g. http://fluidDB.fluidinfo.com/objects/4fdf7ff4-f0da-4441-8e63-9b98ed26fc12). Each tag is also uniquely identified by an URI (e.g. http://fluidDB.fluidinfo.com/tags/xllora/rdf_model_name). And finally each pair object/tag may have a value (e.g. a literal 'test_dummy' or maybe another URI http://fluidDB.fluidinfo.com/objects/a0dda173-9ee0-4799-a507-8710045d2b07). If a object/tag does not have a value you can just point it to the no value URI (or some other convention you like).

Having said that, now you have all the pieces to express FluidDB data in plain shareable RDF. That would mean basically get all the tags for and object, query the values, and then just generate and RDF model by adding the gathered triples. That’s easy. Also, if you align your properties to tags, the ingestion would also become that trivial. I will try to get that piece into Liquid as soon as other issues allow me to do so :D .

Just to close, I would mention once again a key element of this picture. FluidDB opens the door to a truly cooperative, distributed, and online fluid semantic web. It is one of the first examples of how annotations (a.k.a. metadata) can be easily gathered and used on the “cloud” for the masses. Great job guys!

Related posts:

  1. Liquid: RDF endpoint for FluidDB
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Meandre 1.4.0 final release candidate tagged

Easy, reliable, and flexible storage for Python

A while ago I wrote a little post about alternative column stores. One that I mentioned was Tokyo Cabinet (and its associated server Tokyo Tyrant. Tokyo Cabinet it is a key-value store written in C and with bindings for multiple languages (including Python and Java). It can maintain data bases in memory or spin them […]

Related posts:

  1. Temporary storage for Meandre’s distributed flow execution
  2. Efficient storage for Python
  3. A simple and flexible GA loop in Python

A while ago I wrote a little post about alternative column stores. One that I mentioned was Tokyo Cabinet (and its associated server Tokyo Tyrant. Tokyo Cabinet it is a key-value store written in C and with bindings for multiple languages (including Python and Java). It can maintain data bases in memory or spin them to disk (you can pick between hash or B-tree based stores).

Having heard a bunch of good things, I finally gave it a try. I just installed both Cabinet and Tyrant (you may find useful installation instructions here using the usual configure, make, make install cycle). Another nice feature of Tyrant is that it also supports HTTP gets and puts. So having all this said, I just wanted to check how easy it was to use it from Python. And the answer was very simple. Joseph Turian’s examples got me running in less than 2 minutes—see the piece of code below—when dealing with a particular data base. Using Tyrant over HTTP is quite simple too—see PeteSearch blog post.

import pytc,pickle
from numpy import *
 
hdb = pytc.HDB()
hdb.open('casket.tch',pytc.HDBOWRITER|pytc.HDBOCREAT)
 
a = arange(100)
hdb.put('test',pickle.dumps(a))
b = pickle.loads(hdb.get('test'))
if (a==b).all() :
     print 'OK'
hdb.close()

Related posts:

  1. Temporary storage for Meandre’s distributed flow execution
  2. Efficient storage for Python
  3. A simple and flexible GA loop in Python

Large Scale Data Mining using Genetics-Based Machine Learning

Below you may find the slides of the GECCO 2009 tutorial that Jaume Bacardit and I put together. Hope you enjoy it.
Slides
Abstract
We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes […]

Related posts:

  1. Observer-Invariant Histopathology using Genetics-Based Machine Learning
  2. Deadline extended for special issue on Metaheuristics for Large Scale Data Mining
  3. [BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)

Below you may find the slides of the GECCO 2009 tutorial that Jaume Bacardit and I put together. Hope you enjoy it.

Slides

Abstract

We are living in the peta-byte era.We have larger and larger data to analyze, process and transform into useful answers for the domain experts. Robust data mining tools, able to cope with petascale volumes and/or high dimensionality producing human-understandable solutions are key on several domain areas. Genetics-based machine learning (GBML) techniques are perfect candidates for this task, among others, due to the recent advances in representations, learning paradigms, and theoretical modeling. If evolutionary learning techniques aspire to be a relevant player in this context, they need to have the capacity of processing these vast amounts of data and they need to process this data within reasonable time. Moreover, massive computation cycles are getting cheaper and cheaper every day, allowing researchers to have access to unprecedented parallelization degrees. Several topics are interlaced in these two requirements: (1) having the proper learning paradigms and knowledge representations, (2) understanding them and knowing when are they suitable for the problem at hand, (3) using efficiency enhancement techniques, and (4) transforming and visualizing the produced solutions to give back as much insight as possible to the domain experts are few of them.

This tutorial will try to answer this question, following a roadmap that starts with the questions of what large means, and why large is a challenge for GBML methods. Afterwards, we will discuss different facets in which we can overcome this challenge: Efficiency enhancement techniques, representations able to cope with large dimensionality spaces, scalability of learning paradigms. We will also review a topic interlaced with all of them: how can we model the scalability of the components of our GBML systems to better engineer them to get the best performance out of them for large datasets. The roadmap continues with examples of real applications of GBML systems and finishes with an analysis of further directions.

Related posts:

  1. Observer-Invariant Histopathology using Genetics-Based Machine Learning
  2. Deadline extended for special issue on Metaheuristics for Large Scale Data Mining
  3. [BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)

Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre

Below you may find the slides I used during GECCO 2009 to present the paper titled “Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre”. An early preprint in form of technical report can be found as an IlliGAL TR No. 2009001 or the full paper at the ACM digital library

Related […]

Related posts:

  1. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Scaling Genetic Algorithms using MapReduce

Below you may find the slides I used during GECCO 2009 to present the paper titled “Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre”. An early preprint in form of technical report can be found as an IlliGAL TR No. 2009001 or the full paper at the ACM digital library

Related posts:

  1. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  2. Meandre: Semantic-Driven Data-Intensive Flows in the Clouds
  3. Scaling Genetic Algorithms using MapReduce

NIGEL 2006 Part VI: Bacardit

After coming back from GECCO I just uploaded the last of the NIGEL 2006 talks at LCS & GBML Central. This last talk was by Jaume Bacardit and GBML for protein structure prediction.

Related posts:NIGEL 2006 Part V: Bernardó vs. LanziNIGEL 2006 Part IV: Llorà vs. CasillasNIGEL 2006 Part III: Butz vs. Barry

Related posts:

  1. NIGEL 2006 Part V: Bernardó vs. Lanzi
  2. NIGEL 2006 Part IV: Llorà vs. Casillas
  3. NIGEL 2006 Part III: Butz vs. Barry

After coming back from GECCO I just uploaded the last of the NIGEL 2006 talks at LCS & GBML Central. This last talk was by Jaume Bacardit and GBML for protein structure prediction.

Related posts:

  1. NIGEL 2006 Part V: Bernardó vs. Lanzi
  2. NIGEL 2006 Part IV: Llorà vs. Casillas
  3. NIGEL 2006 Part III: Butz vs. Barry

NIGEL 2006 revisited (Part VI): Bacardit

This is the last of the NIGEL talks NIGEL 2006 talks. Enjoy this last one

Jaume Bacardit

Video
[vimeo clip_id=5065758 width=”432″ height=”320″]

Slides
[slideshare id=1384657&doc=nigel-2006-bacardit-090504154202-phpapp02]