Meandre 2.0 Alpha Preview = Scala + MongoDB

A lot of water under the bridge has gone by since the first release of Meandre 1.4.X series. In January I went back to the drawing board and start sketching what was going to be 1.5.X series. The slide deck



A lot of water under the bridge has gone by since the first release of Meandre 1.4.X series. In January I went back to the drawing board and start sketching what was going to be 1.5.X series. The slide deck embedded above is a extended list of the thoughts during the process. As usual, I started collecting feedback from people using 1.4.X in production, things that worked, things that needed improvement, things that were just plain over complicated. The hot recurrent topics that people using 1.4.X could be mainly summarized as:

  • Complex execution concurrency model based on traditional semaphores written in Java (mostly my maintenance nightmare when changes need to be introduced)
  • Server performance bounded by JENA‘s persistent model implementation
  • State caching on individual servers to boost performance increases complexity of single-image cluster deployments
  • Could-deployable infrastructure, but not cloud-friendly infrastructure

As I mentioned, these elements where the main ingredients to target for 1.5.X series. However as the redesign moved forward, the new version represented a radical disruption from 1.4.X series and eventually turned up to become the 2.0 Alpha version described here. The main changes that forced this transition are:

  • Cloud-friendly infrastructure required rethinking of the core functionalities
  • Drastic redesign of the back-end state storage
  • Revisited flow execution engine to support flow execution
  • Changes on the API that render returned JSON documents incompatible with 1.4.X

Meandre 2.0 (currently already available in the the SVN trunk) has been rewritten from scratch using Scala. That decision was motivated to benefit from the Actor model provided by Scala (modeled after Erlang‘s actors). Such model greatly simplify the mechanics of the infrastructure, but it also powered the basis of Snowfield (the effort to create a scalable distributed flow execution engine for Meandre flows). Also, the Scala language expressiveness has greatly reduced the code based size (2.0 code base is roughly 1/3 of the size of 1.4.X series) greatly simplifying the maintenance activities the infrastructure will require as we move forward.

The second big change that pushed the 2.0 Alpha trigger was the redesign of the back end state storage. 1.4.X series heavily relied on the relational storage for persistent RDF models provided by JENA. For performance reasons, JENA caches the model in memory and mostly assumes ownership of the model. Hence, if you want to provide a single-image Meandre cluster you need to inject into JENA cache coherence mechanics, greatly increasing the complexity. Also, the relational implementation relies on the mapping model into a table and triple into a row (this is a bit of a simplification). That implies that large number of SQL statements need to be generated to update models, heavily taxing the relational storage when changes on user repository data needs to be introduced.

An ideal cloud-friendly Meandre infrastructure should not maintain state (neither voluntarily, neither as result of JENA back end). Thus, a fast and scalable back end storage could allow infrastructure servers to maintain no state and be able to provide the appearance of a single image cluster. After testing different alternatives, their community support, and development roadmap, the only option left was MongoDB. Its setup simplicity for small installations and its ability to easily scale to large installations (including cloud-deployed ones) made MongoDB the candidate to maintain state for Meandre 2.0. This was quite a departure from 1.4.x series, where you had the choice to store state via JENA on an embedded Derby or an external MySQL server.

A final note on the building blocks that made possible 2.0 series. Two other side projects where started to support the development of what will become Meandre 2.0.X series:

  1. Crochet: Crochet targets to help quickly prototype REST APIs relying on the flexibility of the Scala language. The initial ideas for Crochet were inspired after reading Gabriele Renzi post on creating a picoframework with Scala (see http://www.riffraff.info/2009/4/11/step-a-scala-web-picoframework) and the need for quickly prototyping APIs for pilot projects. Crochet also provides mechanisms to hide repetitive tasks involved with default responses and authentication/authorization piggybacking on the mechanics provided by application servers.
  2. SnareSnare is a coordination layer for distributed applications written in Scala and relies and MongoDB to implement its communication layer. Snare implements a basic heartbeat system and a simple notification mechanism (peer-to-peer and broadcast communication). Snare relies on MongoDB to track heartbeat and notification mailboxes.

LCS and Software Development

“On the Road to Competence” is a slide deck by Jurgen Appelo with interesting analogies between learning classifier systems and software development. Definitely worth taking a look at it. Related posts:NIGEL 2006 Part II: Dasgupta vs. Booker Large Scale Data Mining using Genetics-Based Machine Learning Software for fast rule matching using vector instructions

Related posts:

  1. NIGEL 2006 Part II: Dasgupta vs. Booker
  2. Large Scale Data Mining using Genetics-Based Machine Learning
  3. Software for fast rule matching using vector instructions

“On the Road to Competence” is a slide deck by Jurgen Appelo with interesting analogies between learning classifier systems and software development. Definitely worth taking a look at it.

Related posts:

  1. NIGEL 2006 Part II: Dasgupta vs. Booker
  2. Large Scale Data Mining using Genetics-Based Machine Learning
  3. Software for fast rule matching using vector instructions

From Galapagos to Twitter: Darwin, Natural Selection, and Web 2.0

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of […]

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce

Yesterday I was visiting Monmouth College to participate on the Darwinpalooza which commemorates the 200th anniversary of Charles Darwin’s birth and the 150th anniversary of the publication of On the Origin of Species. After scratching my head about about what to present, I came out with quite a mix. You will find the abstract of the talk below, as well as the slides I used.

Abstract: One hundred and fifty years have passed since the publication of Darwin’s world-changing manuscript “The Origins of Species by Means of Natural Selection”. Darwin’s ideas have proven their power to reach beyond the biology realm, and their ability to define a conceptual framework which allows us to model and understand complex systems. In the mid 1950s and 60s the efforts of a scattered group of engineers proved the benefits of adopting an evolutionary paradigm to solve complex real-world problems. In the 70s, the emerging presence of computers brought us a new collection of artificial evolution paradigms, among which genetic algorithms rapidly gained widespread adoption. Currently, the Internet has propitiated an exponential growth of information and computational resources that are clearly disrupting our perception and forcing us to reevaluate the boundaries between technology and social interaction. Darwin’s ideas can, once again, help us understand such disruptive change. In this talk, I will review the origin of artificial evolution ideas and techniques. I will also show how these techniques are, nowadays, helping to solve a wide range of applications, from life science problems to twitter puzzles, and how high performance computing can make Darwin ideas a routinary tool to help us model and understand complex systems.

Related posts:

  1. Challenging lectures on-line at TED
  2. Dusting my Ph.D. thesis off
  3. Scaling Genetic Algorithms using MapReduce