Yet Another cGA Implementation, Now in Haskell.

A year ago, I mentioned that I always write a cGA implementation when I learn a new language. Then, I was trying to get back to fluent in Haskell. A couple of days ago, Martin Pelikan just did the same

A year ago, I mentioned that I always write a cGA implementation when I learn a new language. Then, I was trying to get back to fluent in Haskell. A couple of days ago, Martin Pelikan just did the same and wanted to compare implementations. So, what did I do? I looked for my implementation to post it here.

I took a look at the code and change a couple of things, but I can say that the Haskell implementation is the shortest working implementation that I have ever written in any language. It is shorter than the versions I wrote in Scala and Erlang. Python could get awkwardly compressed using some functional flavor to get close to this, but dynamic typing… C, C++, Java, Go and other friends, are far away when you look in the rear Haskell mirror. Anyway, the code below implements cGA for binary strings. You chose the population size, the number of bits, and the evaluation function. Also, some of the constructs are simple and elegant that do not need much explanation (not to mention maintainability…)

import Data.List.Split
import System.Random

diffBinaryIndividuals popSize ind1 ind2 =
	map (\ (x, y) -> if x == y then 0 else (2 * x - 1) / popSize) $ zip ind1 ind2

updateBinaryModel f popSize model ind1 ind2 = 
	zipWith (+) model update
	where f1 = f ind1
	      f2 = f ind2
	      update = if f1 > f2 
	      	then diffBinaryIndividuals popSize ind1 ind2 
	      	else diffBinaryIndividuals popSize ind2 ind1 

sampleTwoBinaryIndividuals model gen =
	chunksOf l $ zipWith (\ m r -> if r < m then 1 else 0) (model ++ model) rnds
	where rnds = take (2 * l) (randoms gen :: [Float])
	      l = length model

cgaStepForBinaryIndividuals f model popSize gen =
	updateBinaryModel f popSize model ind1 ind2
	where ind1 : ind2 : [] = sampleTwoBinaryIndividuals model gen

hasModelConverged model = all (\x -> x > 0.9 || x < 0.1) model

cga _ _ model | hasModelConverged model = return model
cga f popSize model = do
	gen <- newStdGen
	res <- (cga f popSize (cgaStepForBinaryIndividuals f model popSize gen))
	return res

And you can see it in action below solving 5-bit and 50-bit OneMax problems.

> cga (sum) 1000 (take 5 $ repeat 0.5)
[0.90099484,0.9029948,0.9029948,0.9019948,0.9209946]

> cga (sum) 1000 (take 50 $ repeat 0.5)
[0.9209946,0.9279945,0.96899396,0.96899396,0.95399415,0.9259945,0.9419943,0.96299404,0.9589941,0.9419943,0.93799436,0.9519942,0.9109947,0.94599426,0.95399415,0.9449943,0.94799423,0.964994,0.9199946,0.93199444,0.9429943,0.9569941,0.95499414,0.96999395,0.9369944,0.9579941,0.96199405,0.9429943,0.96099406,0.9359944,0.967994,0.9209946,0.9449943,0.966994,0.9329944,0.95499414,0.96999395,0.9449943,0.90799475,0.9579941,0.95299417,0.93999434,0.94699425,0.9179946,0.9559941,0.90099484,0.9359944,0.9339944,0.9339944,0.9359944]

cGA, Parallelism, Processes, and Erlang

Back in Fall 2006 I was lucky to be at the right place, at the right time. Kumara Sastry and David E. Goldberg were working to pulverize some preconceptions about how far you could scale genetic algorithms. As I said,

Back in Fall 2006 I was lucky to be at the right place, at the right time. Kumara Sastry and David E. Goldberg were working to pulverize some preconceptions about how far you could scale genetic algorithms. As I said, I was lucky I could help the best I could. It turned out that the answer was pretty simple, as far as you want. The key to that result was, again, built on Georges Harik’s compact genetic algorithm. The results were published on a paper titled Toward routine billion-variable optimization using genetic algorithms if you are curious.

Anyway, back on track. A few days ago, I was playing with Erlang and I coded, just for fun, yet another cGA implementation, now in Erlang. The code was pretty straight forward, so why not take another crack at it and write an Erlang version that uses some of the ideas we used on that paper.

The idea we used on the paper was simple. Slice the probabilistic model into smaller segments and update all those model fragments in parallel. The only caveat, if you go over the cGA model, is that you need the evaluation of two individuals to decide which way to update the model. Also, you need to know when to stop, or when your global model has converged. The flow is pretty simple:

  1. Sample in parallel two individuals.
  2. Compute the partial evaluation (in the example below the beloved OneMax).
  3. Emit the partial evaluations.
  4. Collect the partial evaluation, and compute the final fitness.
  5. Rebroadcast the final evaluation to all model fragments.
  6. With the final evaluations at hand, just update the model fragments.
  7. Compute if the local fragment of the model has converged and emit the outcome.
  8. With all the partial convergence checks, decide if the global model has globally converged.
  9. If the global model has not converged, continue to (1).

The implementation below is quite rough. It could be cleaned up using functional interfaces to hide all the message passing between processes, but you get the picture. Also, if you look at the implementation below, you may find that the way global fitness and convergence are computed have only one process serializing each those request. You may remember Amdhal’s law, not a big problem with a few thousand model fragments, but as you scale up you are going to eventually have to worry about. For instance, you could improve it, for instance, by using a broadcast tree. Anyway, let’s put all those a side for now, and do a simple implementation to get the ball rolling.

-module(pcga).
-export([one_max/1, cga/6, accumulator/4, has_converged/3, cga_loop/8, time/4]).
 
 % Accumulates the partial evaluations.
accumulator(Pids, Values1, Values2, Groups) when length(Pids) == Groups ->
  Acc1 = lists:sum(Values1),
  Acc2 = lists:sum(Values2),
  lists:map(fun(P) -> P ! {final_eval, Acc1, Acc2} end, Pids),
  accumulator([], [], [], Groups);
accumulator(Pids, Values1, Values2, Groups) when length(Pids) < Groups ->
  receive
    {eval, Pid,	Value1, Value2} ->
        accumulator([Pid | Pids], [Value1 | Values1], [Value2 | Values2], Groups);
    stop -> ok
  end.

% Convergence checker.
has_converged(Pids, Votes, Groups) when length(Pids) == Groups ->
  FinalVote = lists:sum(Votes),
  lists:map(fun(P) -> P ! {final_converged, FinalVote == Groups} end, Pids),
  has_converged([], [], Groups);
has_converged(Pids, Votes, Groups) when length(Pids) < Groups ->
  receive
    {converged, Pid, Vote} ->
      has_converged([Pid | Pids], [Vote | Votes], Groups);
    stop -> ok
  end.

% OneMax function.
one_max(String) -> lists:sum(String).
 
% Generates random strings of length N given a Model.
random_string(Model) ->
  lists:map(fun (P) -> case random:uniform() < P of true -> 1; _ -> 0 end end,
            Model).
 
% Generates a random population of size Size and strings of length N.
initial_model(N) -> repeat(N, 0.5, []).
 
% Given a pair of evaluated strings, returns the update values.
update({_, Fit}, {_, Fit}, N, _) ->
  repeat(N, 0, []);
update({Str1, Fit1}, {Str2, Fit2}, _, Size) ->
  lists:map(fun ({Gene, Gene}) -> 0;
                ({Gene1, _}) when Fit1 > Fit2 -> ((Gene1 * 2) - 1) / Size;
                ({_, Gene2}) when Fit1 < Fit2 -> ((Gene2 * 2) - 1) / Size
            end,
            lists:zip(Str1, Str2)).

% Check if the model has converged.
converged(Model, Tolerance) ->
  lists:all(fun (P) -> (P < Tolerance) or (P > 1 - Tolerance) end, Model).

% The main cGA loop.
cga(N, GroupSize, Groups, Fun, Tolerance, Print) 
  when N > 0, GroupSize > 0, Groups > 0, Tolerance > 0, Tolerance < 0.5 ->
  Acc = spawn(pcga, accumulator, [[], [], [], Groups]),
  Con = spawn(pcga, has_converged, [[], [], Groups]),
  lists:foreach(
    fun(_) ->
      spawn(pcga, cga_loop, 
            [N, GroupSize, Fun, initial_model(GroupSize), Tolerance, Acc, Con, Print])
    end,
    repeat(Groups, 1, [])).
 
cga_loop(N, Size, Fitness, Model, Tolerance, Acc, Con, Print) ->
  [{Str1, P1}, {Str2, P2} | _] = lists:map(
    fun (_) -> Str = random_string(Model), {Str, Fitness(Str)} end,
    [1,2]),
  Acc ! {eval, self(), P1, P2},
  receive
    {final_eval, FF1, FF2} ->
      NewModel = lists:map(fun ({M, U}) -> M + U end,
      lists:zip(Model, update({Str1, FF1}, {Str2, FF2}, Size, Size))),
      case converged(NewModel, Tolerance) of
        true -> Con ! {converged, self(), 1};
        false ->  Con ! {converged, self(), 0}
      end,
      receive
        {final_converged, true} -> 
          case Print of 
            true -> io:fwrite("~p\n", [NewModel]);
            _ -> true
          end,
          Acc ! Con ! stop;
        {final_converged, false} -> 
          cga_loop(N, Size, Fitness, NewModel, Tolerance, Acc, Con, Print)
      end
  end.

The code above allows you to decide how many model fragments (Groups) you are going to create. Each fragment is assigned to a process. Each fragment has GroupSize variable of the model and N is the population size. A simple example on how to run the code:

c(pcga).
pcga:cga(50000, 500, 10, fun pcga:one_max/1, 0.01, true).

The model will contain 50,000 variables split into 10 process each of each containing a fragment of 50 variables. I guess now the only thing left is measure how this scales.

Yet Another cGA Implementation, Now in Erlang.

Wanna have some Sunday afternoon fun? Just refresh your Erlang skills. Since this is me having fun, what better way to do so than to write yet another implementation of the compact Genetic Algorithm originally (cGA) proposed by Georges Harik?

Wanna have some Sunday afternoon fun? Just refresh your Erlang skills. Since this is me having fun, what better way to do so than to write yet another implementation of the compact Genetic Algorithm originally (cGA) proposed by Georges Harik?

I am going to skip describing the original algorithm and focus a bit on how to implement it in Erlang instead. You can find some nice books elsewhere and more information on the Erlang site. Erlang is an interesting mix of functional and logic programming languages. If you ever wrote code in ProLog, Erlang is going to look familiar. It will also look familiar if you are coming from Haskell, although, being Erlang a dynamically typed language, you will miss the type system and inference. Nevertheless, give it a chance. It concurrent model is worthwhile reading about. I will it for further posts thought.

Anyway, without further preamble, let’s dive into a naïve implementation of cGA in Erlang. Lists are an integral part of Erlang, hence it seems obvious that individuals could be represented by a list of integers. Under this representation, OneMax is trivial to implement by summing all the elements of the list defining an individual. Following this train of thought, the probabilistic model could also be represented by a simple list of floats (each entry representing the probability of 1 for a given locus).

Given the above description, a cGA implementation would just require: (1) an individual constructor based on sampling the current probabilistic model, (2) a function that given two evaluated individuals compute the model update, and (3) a function to check if the probabilistic model has converged. Once these basic functions are available, writing a cGA boils down to sampling two individuals, compute the updates required based on the evaluated individuals, and update the probabilistic model. This process should be repeated until the model has converged. The Erlang code below shows a possible implementation of such an approach.

% Naive implementation of the compact Genetic Algorithm in Erlang.
-module(cga).
-export([one_max/1, cga/4]).

% OneMax function.
one_max(String) -> lists:sum(String).

% Generates random strings of length N given a Model.
random_string(Model) ->
  lists:map(fun (P) -> case random:uniform() < P of true -> 1; _ -> 0 end end,
            Model).

% Generates a random population of size Size and strings of length N.
initial_model(N) -> repeat(N, 0.5, []).

% Given a pair of evaluated strings, returns the update values.
update({_, Fit}, {_, Fit}, N, _) ->
  repeat(N, 0, []);
update({Str1, Fit1}, {Str2, Fit2}, _, Size) ->
  lists:map(fun ({Gene, Gene}) -> 0;
                ({Gene1, _}) when Fit1 > Fit2 -> ((Gene1 * 2) - 1) / Size;
                ({_, Gene2}) when Fit1 < Fit2 -> ((Gene2 * 2) - 1) / Size
            end,
            lists:zip(Str1, Str2)).

% Check if the model has converged.
converged(Model, Tolerance) ->
  lists:all(fun (P) -> (P < Tolerance) or (P > 1 - Tolerance) end, Model).

% The main cGA loop.
cga(N, Size, Fun, Tolerance) when N > 0, Size > 0, Tolerance > 0, Tolerance < 0.5 ->
  cga_loop(N, Size, Fun, initial_model(N), Tolerance).

cga_loop(N, Size, Fitness, Model, Tolerance) ->
  case converged(Model, Tolerance) of
    true ->
      Model;
    false ->
      [P1, P2 | _] = lists:map(
        fun (_) -> Str = random_string(Model), {Str, Fitness(Str)} end,
        [1,2]),
      cga_loop(N, Size, Fitness,
               lists:map(fun ({M, U}) -> M + U end,
                         lists:zip(Model, update(P1, P2, N, Size))),
               Tolerance)
  end.

% Creates a list of Size repeating Value.
repeat(0, _, Update) -> Update;
repeat(N, Value, Update) -> repeat(N - 1, Value, [Value | Update]).

You can run this code by pasting it into a file named cga.erl. Use the Erlang shell to compile and run cGA as shown below (once you start the Erlang shell via $erl).

1> c(cga).
{ok, cga.}
2> cga:cga(3, 30, fun cga:one_max/1, 0.01).
[0.999, 0.989, 0.098]

A couple of interesting considerations. Compiling and loading code in Erlang support hot code replacement without stopping a running production system. Obviously this property it is not critical for the cGA exercise, but it is an interesting property nonetheless. Another one is that functions, due to its functional programming ancestry, are first class citizens you can pass around. That means that the current implementation done supports you passing arbitrary fitness functions without having to change anything on the cGA implementation.

Finally, I mentioned that this is a naïve implementation to refresh my rusty Erlang syntax. You may want to spent some time profiling this implementation to see how to improve it. Also, you may want to start thinking on how we could take advantage of the concurrency model in Erlang to build a not-so-naive implementation of cGA.

Parallel and Distributed Computational Intelligence book is out for pre-order

“Parallel and Distributed Computational Intelligence” edited by Francisco Fernández de Vega & Erick Cantú-Paz and published by Springer is out for pre-order. The first chapter “When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing”

“Parallel and Distributed Computational Intelligence” edited by Francisco Fernández de Vega & Erick Cantú-Paz and published by Springer is out for pre-order. The first chapter “When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing” of the book was written together with coauthors Abhishek Verma, Roy Campbell, and David E. Goldberg describing how data-intensive computing can help push the size of problems that GAs and EDAs can address. You may find the abstact of the book below.

Abstract:

The growing success of biologically inspired algorithms in solving large and complex problems has spawned many interesting areas of research. Over the years, one of the mainstays in bio-inspired research has been the exploitation of parallel and distributed environments to speedup computations and to enrich the algorithms. From the early days of research on bio-inspired algorithms, their inherently parallel nature was recognized and different parallelization approaches have been explored. Parallel algorithms promise reductions in execution time and open the door to solve increasingly larger problems. But parallel platforms also inspire new bio-inspired parallel algorithms that, while similar to their sequential counterparts, explore search spaces differently and offer improvements in solution quality.

The objective in editing this book was to assemble a sample of the best work in parallel and distributed biologically inspired algorithms. The editors invited researchers in different domains to submit their work. They aimed to include diverse topics to appeal to a wide audience. Some of the chapters summarize work that has been ongoing for several years, while others describe more recent exploratory work. Collectively, these works offer a global snapshot of the most recent efforts of bioinspired algorithms’ researchers aiming at profiting from parallel and distributed computer architectures—including GPUs, Clusters, Grids, volunteer computing and p2p networks as well as multi-core processors. This volume will be of value to a wide set of readers, including, but not limited to specialists in Bioinspired Algorithms, Parallel and Distributed Computing, as well as computer science students trying to figure out new paths towards the future of computational intelligence.

Scaling eCGA Model Building via Data-Intensive Computing

I just uploaded the technical report of the paper we put together for CEC 2010 on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the Hadoop implementation, it also presents some very compelling results obtained with MongoDB (a document based store able to perform parallel MapReduce tasks via sharding). […]

Related posts:

  1. Scaling Genetic Algorithms using MapReduce
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre

I just uploaded the technical report of the paper we put together for CEC 2010 on how we can scale up eCGA using a MapReduce approach. The paper, besides exploring the Hadoop implementation, it also presents some very compelling results obtained with MongoDB (a document based store able to perform parallel MapReduce tasks via sharding). The paper is available as PDF and PS.

Abstract:
This paper shows how the extended compact genetic algorithm can be scaled using data-intensive computing techniques such as MapReduce. Two different frameworks (Hadoop and MongoDB) are used to deploy MapReduce implementations of the compact and extended com- pact genetic algorithms. Results show that both are good choices to deal with large-scale problems as they can scale with the number of commodity machines, as opposed to previous ef- forts with other techniques that either required specialized high-performance hardware or shared memory environments.

Related posts:

  1. Scaling Genetic Algorithms using MapReduce
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre

New MEDAL reports available online

We are pleased to announce the following MEDAL technical reports:
MEDAL Report No. 2010005
Loopy Substructural Local Search for the Bayesian Optimization Algorithm
Claudio F. Lima, Martin Pelikan, Fernando G. Lobo, and David E. Goldberg (2010)
[Abstract] [Download PDF]
MEDAL Report No. 2010004
Model Accuracy in the Bayesian Optimization Algorithm
Claudio F. Lima, Fernando G. Lobo, Martin Pelikan, and David E. […]

We are pleased to announce the following MEDAL technical reports:

MEDAL Report No. 2010005
Loopy Substructural Local Search for the Bayesian Optimization Algorithm
Claudio F. Lima, Martin Pelikan, Fernando G. Lobo, and David E. Goldberg (2010)
[Abstract] [Download PDF]

MEDAL Report No. 2010004
Model Accuracy in the Bayesian Optimization Algorithm
Claudio F. Lima, Fernando G. Lobo, Martin Pelikan, and David E. Goldberg (2010)
[Abstract] [Download PDF]

MEDAL Report No. 2010003
Network crossover performance on NK landscapes and deceptive problems
Mark W Hauschild and Martin Pelikan (2010)
[Abstract] [Download PDF]

MEDAL Report No. 2010002
Spurious Dependencies and EDA Scalability
Elizabeth Radetic and Martin Pelikan (2010)
[Abstract] [Download PDF]

MEDAL Report No. 2010001
NK Landscapes, Problem Difficulty, and Hybrid Evolutionary Algorithms
Martin Pelikan (2010)
[Abstract] [Download PDF]

Not your grandmother’s genetic algorithm!

The video of David E. Goldberg’s talk on genetic algorithms entitled Not your Grandmother’s Genetic Algorithm is available on youtube.com. The talk covers topics from the simple genetic algorithm to advanced estimation of distribution algorithms, scalability theory of genetic algorithms and practical solutions to noisy problems of over one billion variables. An amazing lecture, and […]

The video of David E. Goldberg’s talk on genetic algorithms entitled Not your Grandmother’s Genetic Algorithm is available on youtube.com. The talk covers topics from the simple genetic algorithm to advanced estimation of distribution algorithms, scalability theory of genetic algorithms and practical solutions to noisy problems of over one billion variables. An amazing lecture, and a must-see for anyone interested in evolutionary computation and stochastic optimization.

The links to the videos: Part 1, part 2, part 3.

The embeds follow

Optimization by Building and Using Probabilistic Models (OBUPM-2010) Workshop

The workshop Optimization by Building and Using Probabilistic Models (OBUPM-2010) will take place at the Genetic and Evolutionary Computation Conference (GECCO-2010) in Portland, OR. OBUPM-2010 is organized by Mark Hauschild and Martin Pelikan.
We look forward to seeing you there and invite submission of papers for the workshop. The deadline for paper submission is March 25, […]

The workshop Optimization by Building and Using Probabilistic Models (OBUPM-2010) will take place at the Genetic and Evolutionary Computation Conference (GECCO-2010) in Portland, OR. OBUPM-2010 is organized by Mark Hauschild and Martin Pelikan.

We look forward to seeing you there and invite submission of papers for the workshop. The deadline for paper submission is March 25, 2010. Please check the website of OBUPM-2010 for more detailed information.

GECCO 2010 Submission Deadline (Extended)

If you are planning to submit a paper for the 2010 Genetic and Evolutionary Computation Conference, the deadline is January 13, 2010 (and now extended to January 27th). You can find more information at the GECCO 2010 calendar site. Related posts:GECCO 2009 paper submission deadline extended till January 28 GECCO 2007 deadline extended GECCO-2006 submissions […]

Related posts:

  1. GECCO 2009 paper submission deadline extended till January 28
  2. GECCO 2007 deadline extended
  3. GECCO-2006 submissions deadline extended to February 1st

If you are planning to submit a paper for the 2010 Genetic and Evolutionary Computation Conference, the deadline is January 13, 2010 (and now extended to January 27th). You can find more information at the GECCO 2010 calendar site.

Related posts:

  1. GECCO 2009 paper submission deadline extended till January 28
  2. GECCO 2007 deadline extended
  3. GECCO-2006 submissions deadline extended to February 1st

Scaling Genetic Algorithms using MapReduce

Below you may find the abstract to and the link to the technical report of the paper entitled “Scaling Genetic Algorithms using MapReduce” that will be presented at the Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009 by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month. Abstract:Genetic algorithms(GAs) are increasingly […]

Related posts:

  1. Scaling eCGA Model Building via Data-Intensive Computing
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre

Below you may find the abstract to and the link to the technical report of the paper entitled “Scaling Genetic Algorithms using MapReduce” that will be presented at the Ninth International Conference on Intelligent Systems Design and Applications (ISDA) 2009 by Verma, A., Llorà, X., Campbell, R.H., Goldberg, D.E. next month.

Abstract:Genetic algorithms(GAs) are increasingly being applied to large scale problems. The traditional MPI-based parallel GAs do not scale very well. MapReduce is a powerful abstraction developed by Google for making scalable and fault tolerant applications. In this paper, we mould genetic algorithms into the the MapReduce model. We describe the algorithm design and implementation of GAs on Hadoop, the open source implementation of MapReduce. Our experiments demonstrate the convergence and scalability upto 105 variable problems. Adding more resources would enable us to solve even larger problems without any changes in the algorithms and implementation.

The draft of the paper can be downloaded as IlliGAL TR. No. 2009007. For more information see the IlliGAL technical reports web site.

Related posts:

  1. Scaling eCGA Model Building via Data-Intensive Computing
  2. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre
  3. Data-Intensive Computing for Competent Genetic Algorithms: A Pilot Study using Meandre