Research – Page 2 – LCS & GBML Central

Meandre 2.0 Alpha Preview = Scala + MongoDB

A lot of water under the bridge has gone by since the first release of Meandre 1.4.X series. In January I went back to the drawing board and start sketching what was going to be 1.5.X series. The slide deck embedded above is a extended list of the thoughts during the process. As usual, I started collecting feedback from people using 1.4.X in production, things that worked, things that needed improvement, things that were just plain over complicated. The hot recurrent topics that people using 1.4.X could be mainly summarized as:

Complex execution concurrency model based on traditional semaphores written in Java (mostly my maintenance nightmare when changes need to be introduced)
Server performance bounded by JENA‘s persistent model implementation
State caching on individual servers to boost performance increases complexity of single-image cluster deployments
Could-deployable infrastructure, but not cloud-friendly infrastructure

As I mentioned, these elements where the main ingredients to target for 1.5.X series. However as the redesign moved forward, the new version represented a radical disruption from 1.4.X series and eventually turned up to become the 2.0 Alpha version described here. The main changes that forced this transition are:

Cloud-friendly infrastructure required rethinking of the core functionalities
Drastic redesign of the back-end state storage
Revisited flow execution engine to support flow execution
Changes on the API that render returned JSON documents incompatible with 1.4.X

Meandre 2.0 (currently already available in the the SVN trunk) has been rewritten from scratch using Scala. That decision was motivated to benefit from the Actor model provided by Scala (modeled after Erlang‘s actors). Such model greatly simplify the mechanics of the infrastructure, but it also powered the basis of Snowfield (the effort to create a scalable distributed flow execution engine for Meandre flows). Also, the Scala language expressiveness has greatly reduced the code based size (2.0 code base is roughly 1/3 of the size of 1.4.X series) greatly simplifying the maintenance activities the infrastructure will require as we move forward.

The second big change that pushed the 2.0 Alpha trigger was the redesign of the back end state storage. 1.4.X series heavily relied on the relational storage for persistent RDF models provided by JENA. For performance reasons, JENA caches the model in memory and mostly assumes ownership of the model. Hence, if you want to provide a single-image Meandre cluster you need to inject into JENA cache coherence mechanics, greatly increasing the complexity. Also, the relational implementation relies on the mapping model into a table and triple into a row (this is a bit of a simplification). That implies that large number of SQL statements need to be generated to update models, heavily taxing the relational storage when changes on user repository data needs to be introduced.

An ideal cloud-friendly Meandre infrastructure should not maintain state (neither voluntarily, neither as result of JENA back end). Thus, a fast and scalable back end storage could allow infrastructure servers to maintain no state and be able to provide the appearance of a single image cluster. After testing different alternatives, their community support, and development roadmap, the only option left was MongoDB. Its setup simplicity for small installations and its ability to easily scale to large installations (including cloud-deployed ones) made MongoDB the candidate to maintain state for Meandre 2.0. This was quite a departure from 1.4.x series, where you had the choice to store state via JENA on an embedded Derby or an external MySQL server.

A final note on the building blocks that made possible 2.0 series. Two other side projects where started to support the development of what will become Meandre 2.0.X series:

Crochet: Crochet targets to help quickly prototype REST APIs relying on the flexibility of the Scala language. The initial ideas for Crochet were inspired after reading Gabriele Renzi post on creating a picoframework with Scala (see http://www.riffraff.info/2009/4/11/step-a-scala-web-picoframework) and the need for quickly prototyping APIs for pilot projects. Crochet also provides mechanisms to hide repetitive tasks involved with default responses and authentication/authorization piggybacking on the mechanics provided by application servers.
Snare: Snare is a coordination layer for distributed applications written in Scala and relies and MongoDB to implement its communication layer. Snare implements a basic heartbeat system and a simple notification mechanism (peer-to-peer and broadcast communication). Snare relies on MongoDB to track heartbeat and notification mailboxes.

My first interview …

IWLCS 2010 – Discussion session on LCS / XCS(F)

I just got an email from Martin Butz about a discussion session being planned for IWLCS 2010 and his request to pass it along. Hope all is well and you are going to attend GECCO this year. Regardless if you…

GAssist and GALE Now Available in Python

Ryan Urbanowicz has released Python versions of GAssits and GALE!!! Yup, so excited to see a new incarnation of GALE doing the rounds. I cannot wait to get my hands on it. Ryan has also done an excellent job porting UCS, XCS, and MCS to Python and making those implementations available via “LCS & GBML central” for […]

More information about Ryan’s implementations can found below

Side note: my original GALE implementation can also be downloaded here.

LCS & GBML Central Gets a New Home

Today I finished migrating the LCS & GBML Central site from its original URL (http://lcs-gbml.ncsa.uiuc.edu) to a more permanent and stable home located at http://gbml.org. The original site is already currently redirecting the trafic to the new site, and it will be doing so for a while to help people transition and update bookmarks and […]

Today I finished migrating the LCS & GBML Central site from its original URL (http://lcs-gbml.ncsa.uiuc.edu) to a more permanent and stable home located at http://gbml.org. The original site is already currently redirecting the trafic to the new site, and it will be doing so for a while to help people transition and update bookmarks and feed readers.

I have introduced a few changes to the functionality of the original site. Functional changes can be mostly summarized by (1) dropping the forums section and (2) closing comments on posts and pages. Both functionalities, rarely used in their current form, have been replaced by a simpler public embedded Wave reachable at http://gbml.org/wave. The goal, provide people in the LCS & GBML community a simpler way to discuss, share, and hang out.

About the feeds being aggregated, I have revised the list and added the feeds now available of the table of contents from

I have also added a few other links to relevant research groups doing work on related areas. Please, leave a comment on this post if you know/have a related site that could be aggregated, or if there are missing links to research groups or useful resources.

ICPR 2010 – Contest: Extended Deadline May, 26

Call for Contest Participation – Classifier domains of competence: The landscape contest (ICPR 2010) Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants on targeted-complexity problems may contribute to enrich our […]

Call for Contest Participation – Classifier domains of competence: The landscape contest (ICPR 2010)

Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants on targeted-complexity problems may contribute to enrich our understanding of the behavior of machine learning techniques and open further research lines.

The contest will take place on August 22, during the 20th International Conference on Pattern Recognition (ICPR 2010) at Istanbul, Turkey.

We encourage everyone to participate and share with us your work! For further details about dates and submission, please see http://www.salle.url.edu/ICPR10Contest/.

SCOPE OF THE CONTEST

The landscape contest involves the running and evaluation of classifier systems over synthetic data sets. Over the last two decades, the pattern recognition and machine learning communities have developed many supervised learning techniques. Nevertheless, the competitiveness of such techniques has always been claimed over a small and repetitive set of problems. This contest provides a new and configurable testing framework, reliable enough to test the robustness of each technique and detect its limitations.

INSTRUCTION FOR PARTICIPANTS

Contest participants are allowed to use any type of technique. However, we highly encourage and appreciate the use of novel algorithms.

Participants are required to submit the results by email to the organizers.
Submission e-mail: nmacia@salle.url.edu
Meet the submission deadline: Wednesday May 26, 2010

The contest is divided into two phases: (1) offline test and (2) live test. For the offline test, participants should run their algorithms over two sets of problems, S1 and S2. However, the real competition, the live test, will take place during the conference. Two more collections of problems, S3 and S4, will be presented.

S1: Collection of data sets spread along the complexity space to train the learner. All the instances will be duly labeled.

S2: Collection of data sets spread along the complexity space with no class labeling to test the learner performance.

S3: Collection of data sets with no class labeling, like S2 to be run for a limited period of time.

S4: Collection of data sets with no class labeling covering specific regions of the complexity space to determine the neighborhood dominance.

For the offline test, the results report consists of:

1. Labeling the data sets of the collection S2.

The procedure is the following:

Train the learner using Dn-trn.arff in S1.
Provide the rate of the correctly classified instances over a 10-fold cross validation.
Label the corresponding data set Dn-tst.arff in S2.
Store the n models generated for each data set to perform the live contest on August 22. Be ready to load them on this day.

2. Describing the techniques used.

A brief summary (1~2 pages) of the machine learning technique/s used in the experiments must be submitted. We expect details such as the learning paradigm, configuration parameters, strength and limitations, and computational cost.

IMPORTANT DATES

* May 26, 2010: Deadline for submission of the results and technical report

* May 29, 2010: Notification of participation

* Aug 22, 2010: Release of S3 and S4

* Aug 22, 2010: ICPR 2010 – Interactive Session

CONTACT DETAILS

Dr. Tin Kam Ho – tkh at research.bell-labs.com
Núria Macià – nmacia at salle.url.edu
Prof. Albert Orriols Puig – aorriols at salle.url.edu
Prof. Ester Bernadó Mansilla – esterb at salle.url.edu

ICPR 2010 – Contest: Extended Deadline May, 26

Call for Contest Participation – Classifier domains of competence: The landscape contest (ICPR 2010)

The contest will take place on August 22, during the 20th International Conference on Pattern Recognition (ICPR 2010) at Istanbul, Turkey.

We encourage everyone to participate and share with us your work! For further details about dates and submission, please see http://www.salle.url.edu/ICPR10Contest/.

SCOPE OF THE CONTEST

INSTRUCTION FOR PARTICIPANTS

Contest participants are allowed to use any type of technique. However, we highly encourage and appreciate the use of novel algorithms.

Participants are required to submit the results by email to the organizers.
Submission e-mail: nmacia@salle.url.edu
Meet the submission deadline: Wednesday May 26, 2010

S1: Collection of data sets spread along the complexity space to train the learner. All the instances will be duly labeled.

S2: Collection of data sets spread along the complexity space with no class labeling to test the learner performance.

S3: Collection of data sets with no class labeling, like S2 to be run for a limited period of time.

S4: Collection of data sets with no class labeling covering specific regions of the complexity space to determine the neighborhood dominance.

For the offline test, the results report consists of:

1. Labeling the data sets of the collection S2.

The procedure is the following:

Train the learner using Dn-trn.arff in S1.
Provide the rate of the correctly classified instances over a 10-fold cross validation.
Label the corresponding data set Dn-tst.arff in S2.
Store the n models generated for each data set to perform the live contest on August 22. Be ready to load them on this day.

2. Describing the techniques used.

IMPORTANT DATES

* May 26, 2010: Deadline for submission of the results and technical report

* May 29, 2010: Notification of participation

* Aug 22, 2010: Release of S3 and S4

* Aug 22, 2010: ICPR 2010 – Interactive Session

CONTACT DETAILS

Last call for participation to the Lanscape Contest

The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants may contribute to enrich our understanding of the behavior of machine learning techniques and open further research lines.
The contest will take place […]

The contest will take place on August 22, during the 20th International Conference on Pattern Recognition (ICPR 2010) at Istanbul, Turkey.

We encourage everyone to participate and share with us your work! For further details about dates and submission, please see this document or visit the contest webpage.

the attached PDF document or visit the contest webpage: http://www.salle.url.edu/ICPR10Contest/.

Scaling eCGA Model Building via Data-Intensive Computing

I just uploaded the technical report of the paper we put together for CEC 2010 on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the Hadoop implementation, it also presents some very compelling results obtained with MongoDB (a document based store able to perform parallel MapReduce tasks via sharding). […]

Abstract:
This paper shows how the extended compact genetic algorithm can be scaled using data-intensive computing techniques such as MapReduce. Two different frameworks (Hadoop and MongoDB) are used to deploy MapReduce implementations of the compact and extended com- pact genetic algorithms. Results show that both are good choices to deal with large-scale problems as they can scale with the number of commodity machines, as opposed to previous ef- forts with other techniques that either required specialized high-performance hardware or shared memory environments.

ICPR 2010 – Contest

Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants may contribute to enrich our understanding of the behavior of machine learning and open further research lines. Contest participants are allowed to […]

Classifier domains of competence: The landscape contest is a research competition aimed at finding out the relation between data complexity and the performance of learners. Comparing your techniques to those of other participants may contribute to enrich our understanding of the behavior of machine learning and open further research lines. Contest participants are allowed to use any type of technique. However, we highly encourage and appreciate the use of novel algorithms.

The contest will take place on August 22, during the 20th International Conference on Pattern Recognition (ICPR 2010) at Istanbul, Turkey.

We are planning to have a day workshop during the ICPR 2010, so that participants will be able to present and discuss their results.

We encourage everyone to participate and share with us your work! For further details about dates and submission, please visit The landscape contest webpage.