Research – Page 2 – LCS & GBML Central

Deadline extended for special issue on Metaheuristics for Large Scale Data Mining

The deadline for the special issue on Metaheuristics for Large Scale Data Mining to be published by Springer’s Memetic Computing Journal has been extended till May 31, 2009. More information can be found in this post at LCS & GBML Central.

Related posts:[BDCSG2008] Algorithmic Perspectives on Large-Scale Social Network Data (Jon Kleinberg)Special issue on chance discovery […]

LCSweb + GBML blog = LCS & GBML Central

LCSweb was designed to allow researchers and those seeking to use Learning Classifier Systems within applications access to material on LCS and discussion between members of the LCS community. The site served this community since its was started by Alwyn Barry in 1997. Enhanced and maintained later by Jan Drugowitsch, LCSweb became a valuable community resource. The site was completely community-driven and allowed members to contribute to the content of the site and keeping it up to date. Later on in 2005, I started “LCS and other GBML” Blog to cover a gap providing information information regarding the International Workshop on Learning Classifier Systems (IWLCS), the collection of LCS Books available, and GBML related news.

Some of you may have realized that after Jan’s move to Rochester and Alwyn’s retirement from research activities, LCSweb has vanished. Will Browne took on himself to take LCSweb to Reading, but technical circumstances have made that move rocky despite his best efforts. Jan and Will however still have a local copy of LCSweb contents. After talking to Jan and Will, I proposed to merge LCSweb with the LCS and other GBML blog, and host the new site at NCSA where dedicated resources has been made available. Jan and Will agreed with the idea.

We are happy to announce that the merged site (still under the update cycle) can be reached at http://lcs-gbml.ncsa.uiuc.edu. More information about the process can be found here or at there LCS & GBML Central site.

Wolfram|Alpha is going life in two months

Some time ago I was told about the project Wolfram|Alpha by the creator of Mathematica and the author of a new kind of science (NKS), Stephen Wolfram. This project aims at going beyond the typical process of search engines by proposing a system that computes the answers of user questions. That is, instead of going to the data an retrieve information by the syntactic similarity with the user question, the new architecture will try to figure it out the answer, which may not be explicitly written in the web documents, by processing the data. For this purpose, Wolfram proposes to use Mathematica and the NKS to explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable. In addition, there must be the help of human experts to formalize each domain.

Therefore, a new approach, very different to that of natural language processing, that promises to make knowledge computable. Fortunately, I will need to wait only two monts to answer all the questions that arose after reading the Wolfram blog.

Goldbergâ€™s and OllÃ©â€™s interview on BTV â€œThe engineer of the future?â€

Barcelona TV showed a video in which David E. Goldberg is interviewed about the problems of the current engineering systems. Goldberg emphasizes that the role of engineers has moved from category enhancers to category creators in the current days. In addition, he also highlights the importance of teaching the human dimension of the history of technology that we use and of presenting the heroes that created the objects that seem to have become indispensable in our life. In summary, engineering schools, and specifically we as teachers, need to spread the joy of engineering and never forget that engineers are people that can build applications that may improve people’s life.

All these ideas are complemented in a discussion in which Ramon OllÃ© and Josep Amat participate. The discussion resulted in many valuable arguments that may explain the decreasing number of students that go to the engineering school and in some ideas of how this tendency could be reversed. Arguments in favor of both introducing more business concepts and introducing more technical concepts appear in the discussion.

To wrap up: a video really worth watching which I think that makes several key points about what engineering is and what the engineering of the future should be. The only drawback: the language. Except for Goldbergâ€™s interview, the remaining part of the video is only in Catalan.

Goldberg’s and Ollé’s interview on BTV “The engineer of the future?”

All these ideas are complemented in a discussion in which Ramon Ollé and Josep Amat participate. The discussion resulted in many valuable arguments that may explain the decreasing number of students that go to the engineering school and in some ideas of how this tendency could be reversed. Arguments in favor of both introducing more business concepts and introducing more technical concepts appear in the discussion.

To wrap up: a video really worth watching which I think that makes several key points about what engineering is and what the engineering of the future should be. The only drawback: the language. Except for Goldberg’s interview, the remaining part of the video is only in Catalan.

Efficient serialization for Java (and beyond)

I am currently working on the distributed execution of flows as part of the Meandre infrastructure—as a part of the SEASR project. One of the pieces to explore is how to push data between machines. No, I am not going to talk about network protocols and the like here, but how you can pass the data around. If you have ever programmed MPI using C/C++ you remember the tedious efforts that requires passing complex data structures around between processes. Serialization is a way to take those complexÂ structuresÂ into a form that can be easily stored/transmitted, and then retrieved/received and regenerate the original complex data structure. Some languages/platforms support this functionality (e.g. Java, Python), allowing to easily use the serialized representation for persistency orÂ transmissionÂ purposes.

Last Thursday I was talking toÂ Abhishek Verma, and he pointed out Google’s Protol Buffer project—Google’s take data interchange formats. Not a new idea—for instance Corba’s IDL has been around for a long time—but what caught my eye was their claims about: (1) efficiency, and (2) multiple language bindings. I was contemplating using XStream for Meandre distributed flow execution needs, but the XML heavy weight made me quiteÂ reluctantÂ to walk down that path. Â The Java native serialization is not a bad choice in terms of efficiency, but does not provide friendly mechanics for modifying data formats without rendering already serialized objects useless, neither a transparent mechanism to allow bindings for other languages/platforms. SoÂ theÂ Google’s Protol BufferÂ seemed an option worth trying. So there I went, and I prepare a simple comparison between the tree: (1) Java serialization, (2)Â Google’s Protol Buffer, and (3)Â XStream.Â Yes, you may guess the outcome, but I was more interested on getting my hands dirty, see howÂ Google’s Protol BufferÂ perform, and how much overhead for the developer it required.

The experiment

Before getting into the description, this experiment does not try to be an exhaustive performance evaluation, just an afternoon diversion. Having said so, the experiment measured the serialization/deserialization time and space used for a simple data structure containing just one array of integers and one array of strings. All the integers were initialized to zero, and the strings to “DummyÂ text”. To allow measuring the time required to serialize this simple object, the number of integers and strings were increased incrementally. The code below illustrates the implementation of the Java native serialization measures.

package org.meandre.tools.serialization.xstream;
 
public class TargetObject {
 
       public String [] sa;
       public int [] ia;
 
       public TargetObject ( int iStringElements, int iIntegerElements ) {
             sa = new String[iStringElements];
             for ( int i=0 ; i<iStringElements ; i++ )
                  sa[i] = "Dummy text";
             ia = new int[iIntegerElements];
       }
}

The experiment consisted on generating objects like the above containing from 100 to 10,000 elements by increments of 100. Each object was serialized 50 times, measuring the average serialization time and the space required (in bytes) per object generated. Below you may have the sample code I used to measure native java serialization/deserialization times.

package org.meandre.tools.serialization.java;
 
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;
 
import org.junit.Test;
 
public class JavaSerializationTest {
 
       @Test
       public void testJavaSerialization ()
       throws IOException {
             final int MAX_SIZE = 10000;
             final int REP = 50;
             final int INC = 100;
 
             System.out.println("Java serialization times");
             for ( int i=INC ; i<=MAX_SIZE ; i+=INC ) {
                  TargetObjectSerializable tos = new TargetObjectSerializable(i,i);
                  long lAccTime = 0;
                  long lSize = 0;
                  long lTmp;
                  ByteArrayOutputStream baos;
                  ObjectOutputStream out;
                  for ( int j=0 ; j<REP ; j++ ) {
                      baos = new ByteArrayOutputStream();
                      out = new ObjectOutputStream(baos);
                      lTmp = System.currentTimeMillis();
                      out.writeObject(tos);
                      lTmp -= System.currentTimeMillis();
                      out.close();
                      lAccTime -= lTmp;
                      lSize = baos.size();
                  }
                  System.out.println(""+i+"\t"+(((double)lAccTime)/REP)+"\t"+lSize);
             }
       }
 
 
       @Test
       public void testJavaDeserialization ()
       throws IOException, ClassNotFoundException {
             final int MAX_SIZE = 10000;
             final int REP = 50;
             final int INC = 100;
 
             System.out.println("Java deserialization times");
             for ( int i=INC ; i<=MAX_SIZE ; i+=INC ) {
                  TargetObjectSerializable tos = new TargetObjectSerializable(i,i);
                  ByteArrayOutputStream baos = new ByteArrayOutputStream();
                  ObjectOutputStream out = new ObjectOutputStream(baos);
                  out.writeObject(tos);
                  out.close();
                  ByteArrayInputStream bais;
                  ObjectInputStream ois;
                  long lAccTime = 0;
                  long lTmp;
                  for ( int j=0 ; j<REP ; j++ ) {
                      bais = new ByteArrayInputStream(baos.toByteArray());
                      ois = new ObjectInputStream(bais);
                      lTmp = System.currentTimeMillis();
                      ois.readObject();
                      lTmp -= System.currentTimeMillis();
                      lAccTime -= lTmp;
                  }
                  System.out.println(""+i+"\t"+(((double)lAccTime)/REP));
             }
       }
}

Equivalent versions of the code shown above were used to measure Google’s Protol Buffer and XStream. If you are interested on seeing the full code you can download it as it is—no guarantees provided. Also, for completion of the experiment code, you can find below the proto file use for testing the Java implementation of Google’s Protol Buffer.

package test;
 
option java_package = "org.meandre.tools.serialization.proto";
option java_outer_classname = "TargetObjectProtoOuter";
 
message TargetObjectProto {
  repeated int32 ia = 1;
  repeated string sa = 2;
}

In order to run the experiment, besides Google’s Protol Buffer and XStream libraries, you will also need JUnit.

The results

The experiments were run on an first generation MacBook Pro using Apple’s Java 1.5 virtual machine with 2Gb of RAM. The figure below illustrated the different memory requirements for each of the the three serialization methods compared. Figures and data processing was done using R.

Figures show the already intuited bloated size of XML-based XStream serialization, up to 6 time larger than the original data being serialized. On the other hand, the Java native serialization provides a minimal increase on the serialized equivalent. Google’s Protocol Buffer presents a slightly larger requirement than the native Java serialization, but never doubled the original size. Moreover, it does not exhibit the constant initial payload overhead displayed by both XStream and the Java native serialization. The next question was how costly was the serialization process. Figures below show the amount of time required to serialize an object.

The Java native serialization was, as expected the fastest, however Google’s Protocol Buffer took only, on average, four times the more time than the Java native version. However, that is peanuts when compared to the fifty times slower XStream version. Deserialization times of the encoded object presents the same trends as the serialization, as the figures below show.

It is also interesting to note that serialization—as the figures below show—is faster than deserialization (as common sense would have suggested). However, it is interesting to note that Google’s Protocol Buffer is the method where these difference is more pronounced.

The lessons learned

As I said, this is far from being an exhaustive or even representative example, but just one afternoon exploration. However, the results show interesting trends. Yes, XStream could also be tweaked to make the searialized XML leaner, and even would—with the proper tinkering—make possible deserialize the object on a different platform/language, but at an enormous cost—both in size and time. The Java native serialization is by far the fastest and the most size efficient, but is made from and for Java. Also, changes on the serialized classes—imagine wanting to add or remove a field—may render the serialize objects unreadable. Google Protocol Buffers on the other hand delivers the best of both scenarios: (1) the ability to serialize/deserialize objects in a compact and relatively fast manner, and (2) allows the serialization/deserialization to happen between different languages and platforms. For these reasons, it seems to be a very interesting option to keep exploring, if you need both.

NCSA/IlliGAL Gathering on Evolutionary Learning (NIGEL’2006)

On May 16th and 17th, a group formed by more than twenty researchers got together in Urbana-Champaign (Illlinois) to participate in the gathering on evolutionary learning organized by the National Center for Supercomputer Applications and the Illinois Genetic Algorithms Laboratory (NIGEL 2006). The goals were to discus current state-of-the-art research in learning classifier systems and other genetics-based machine learning, and to identify future research trends and applications where evolutionary learning might provide a competitive advantage. The first day attendees gave presentations about challenges and current research topics (see the materials below). The second day, a series of three topic-oriented brainstorming sessions were conducted covering: (1) future of LCS and other GBML, (2) areas of application, and (3) techniques.

The list of participants included Loretta Auvil, Jaume Bacardit, Alwyn Barry, Lashon Booker, Ester Bernado, Will Browne, Martin Butz, Jorge Casillas, Helen Dam, Dipankar Dasgupta, Deon Garrett, David Goldberg, Noriko Imafuji, Pier Luca Lanzi, Xavier Llora, Kumara Sastry, Kamran Shafi, Kenneth Turvey, Michael Welge, Ashley Williams, Stewart Wilson, and Paul Winward.

Presentations slides and videos of the presentations

Some pictures of the event can be found here or at the NIGEL web site.

Xavier Llorà: “Welcome and presentation”[Slides][Video]

Stewart W. Wilson: “Can We Do Captchas?” [Slides][Video]

David E. Goldberg: “Searle, Intentionality, and the Future of Classifier Systems” [Slides][Video]

Dipankar Dasgupta: “Artificial Immune Systems in Anomaly Detection” [Slides][Video]

Lashon Booker: “A Retrospective Look at Classifier System Research” [Slides][Video]

Martin Butz: “XCS: Current Capabilities and Future Challenges” [Slides][Video]

Alwyn Barry: “Towards a Formal Framework for Accuracy-based LCS” [Slides][Video]

Xavier Llorà: “Linkage Learning for Pittsburgh Learning Classifier Systems: Making Problems Tractable” [Slides][Video]

Jorge Casillas: “Scalability in GBML, Accuracy-Based Michigan Fuzzy LCS, and New Trends” [Slides][Video]

Ester Bernadó: “Learning Classifier Systems for Unbalanced Datasets” [Slides][Video]

Pier-Luca Lanzi: “Computed Prediction: so far, so good. Now what?” [Slides][Video]

Jaume Bacardit: “Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power” [Slides][Video]

NCSA/IlliGAL Gathering on Evolutionary Learning (NIGEL’2006)

Presentations slides and videos of the presentations

Some pictures of the event can be found here or at the NIGEL web site.

Xavier Llorà: “Welcome and presentation”[Slides]

Stewart W. Wilson: “Can We Do Captchas?” [Slides]

David E. Goldberg: “Searle, Intentionality, and the Future of Classifier Systems” [Slides]

Dipankar Dasgupta: “Artificial Immune Systems in Anomaly Detection” [Slides]

Lashon Booker: “A Retrospective Look at Classifier System Research” [Slides]

Martin Butz: “XCS: Current Capabilities and Future Challenges” [Slides]

Alwyn Barry: “Towards a Formal Framework for Accuracy-based LCS” [Slides]

Xavier Llorà: “Linkage Learning for Pittsburgh Learning Classifier Systems: Making Problems Tractable” [Slides]

Jorge Casillas: “Scalability in GBML, Accuracy-Based Michigan Fuzzy LCS, and New Trends” [Slides]

Ester Bernadó: “Learning Classifier Systems for Unbalanced Datasets” [Slides]

Pier-Luca Lanzi: “Computed Prediction: so far, so good. Now what?” [Slides]

Jaume Bacardit: “Pittsburgh Learning Classifier Systems for Protein Structure Prediction: Scalability and Explanatory Power” [Slides]