Monday, July 20, 2015

Evolution in a Computer, Pt. 2

After a long waiting, it's here! To seamlessly continue from the previous part, I prepared a little demo of a genetic algorithm. What we have here is a so-called Facility location problem, or a variation of it.

There is a bunch of cities (50 in our case) and we want to supply the cities with goods so we must open some factories. We must decide how many factories we are going to open and in which cities. We won't explicitly assign which cities are to be supplied by which factories but a city will be assigned to the nearest factory (imagine the factories have unlmitted capacity). The catch: we want to minimize the total cost:

  • every opened factory costs us some fixed amount of money
  • delivering goods from a factory to a city costs us an amount of money equal to the (euclidean) distance from the factory to the city
Clearly, we cannot try every combination of cities with a factory as it's \(2^{50} \approx 1.13 \cdot 10^{15}\) combinations. That would took almost 36 years if checking one solution took one microsecond. Yes, this problem is NP-hard. Instead, we are going to use a... Genetic Algorithm!

So how do we represent a solution to our problem? Well, we'll use a binary string, as in proper GA, with the number of bits equal to the number of cities. Then \(1\) in position \(n\) means that the \(n\)-th city has a factory and vice versa, e.g. a string like 001000110 would mean that the third, seventh and eighth city would have a factory while the first, second, fourth, fifth, sixth and ninth wouldn't. Now, give it a try!

Clicking Start starts the genetic algorithm from beginning with the current map. Clicking New map generates a new (random) map. Clicking Stop during the optimization stops it. You can edit the cost of opening a facility, size of population (must be positive even number), probability of crossover and mutation (must be between 0 and 1) and size of tournament (must be between 2 and size of population). The evolution runs indefinitely (i.e. until you stop it). The world is 100 by 50 units. This thing is written in Dart and you can find the source here.

Real-valued Evolution

So far we have spoken about a genetic algorithm that uses binary strings. This means that our solutions are inherently discrete. But what if we wanted to really find the minimum of the function from the part one?

The „standard“ way

The first way of dealing with real number is to just express them using a binary standard, e.g. the IEEE 754 single-precision floating-point format. But this has problems. Imagine numbers 1.763 and -2.611. These would look like this (e.g. using this tool):

 1.763 = 00111111111000011010100111111100
-2.611 = 11000000001001110001101010100000
And now imagine what would happen when we crossed them over using single-point crossover after the 5th bit:
00111111111000011010100111111100
11000000001001110001101010100000
turns into
00111000001001110001101010100000 = 3.9840699e-05
11000111111000011010100111111100 = -115539.96875
One could say "Well, that escalated quickly!" and indeed, these problems arise all the time. If we explore all the possible crossovers (there are 32 of them resulting in 64 numbers) they include gems like -3.252161001305617e+19, 3.591135600494495e-39 and NaN. Bit-flip mutation is going to have similar problems.

The good way

Much better way than sticking to the binary representation is just use the real numbers. Then, if we have a one-dimensional problem (like in searching for the minimum of that function), the genotype is just one real number. This brings a lot of changes in the crossover and mutation operators.

As we have seen, just crossing over the binary representation is bad. Therefore we need to find a better way of combining two individuals. We would like the offspring(s) to be „close“ to the parents, but be different from them. Since we are in the real numbers, we can use the power of probability and random sampling! One of the simplest real-valued crossover is a so-called arithmetic crossover, which is basically a weighted mean of two (or more!) parents with randomly chosen weights. Or more formally:

$$ \mathbf{x}^{Offs} = \sum_{i=1}^{N_{Pa}} \alpha_i \mathbf{x}_i $$ where \(\sum_{i=1}^{N_{Pa}} \alpha_i = 1, \alpha_i \in (0, 1)\); \(\alpha_i\) are picked randomly for each offspring.
And there are many other crossover operators, e.g. BLX-α, SBX, PCX...

The mutation is easier. One of the most used way is to just add a randomly generated number (or vector if our genotype is more numbers) from a distribution of our choice. The most used distributions are the uniform distribution (where we need to specify the bounds) and the normal distribution (where we need to specify the variance).

Next...

The final part of this introductory mini-series, which will finally get us to what I'm actually doing, is going to be big. It is going to be about Genetic Programming. So stay tuned!