Erlang Central

Generate Unique Values

Revision as of 13:40, 11 December 2009 by TribbleFaith467 (Talk | contribs)



Thomas Arts with additional ideas from John Hughes

Generating Unique Values

How to write a generator that produces unique values?

Normally, if you need unique values, there is some notion of 'state', only when there is state, it is important that a value differs from a previous value. For example, if you send a sequence of messages to a server and each message needs a unique tag, different from previous tags. The server 'remembers' the state and if you want to perform positive testing, you want to avoid sending the same tag twice.

Lists of unique values

It may be worth considering to just create a deterministic list and take consecutive values from that list.

unique() ->

This is very deterministic and no random testing is involved at all, but it may well suffice your purpose.

More advanced, if you do not use integers as values, you could create the list and filter out the values that occur more than once. Since all terms in Erlang can be compared, there is a very simple way to filter duplicates, just use lists:usort/1.

unique(Generator) ->

Note that efficiency is no issue in test value generation!

In some cases, having a strictly increasing lists of values may be insufficient for testing. One may want to shuffle the values around to see the subject under test handle that well.

shuffle([]) ->
shuffle(L) ->

One possibly nice thing about this is the way that it shrinks:

prop_shuffle() ->

erlang> eqc:quickcheck(example:prop_shuffle()).
Failed! After 1 tests.
Shrinking......(6 times)

Which may be of value in your testing in case you are sure that in particular for sorted lists, your software should work.

Vectors of unique values

What if one wants to create a fixed amount of unique values, say a vector of length N. Once again, you may be happy with lists:seq(1,N), if you are happy with integers. You can even write a function from integers to your actual value that generates a unique value depending on your input.

Another surprisingly well working approach is to generate vectors until you get one with only unique numbers.

uvector(N,G) ->

This may seem rather inefficient, since many vectors may contain a duplicate and you throw the vector away and build a new one. It only seems, since this performs really well as long as the number of elements that G can generate is sufficiently larger than the length N of the vector.

erlang> eqc_gen:sample(example:uvector(20,eqc_gen:int()))).

It might feel counter-intuitive to generate random lists until finally one finds one in which no duplicates occur... The temptation is strong to be smarter than QuickCheck and do an "efficient" implementation. One way would be to generate new elements until that element has not been generated before until enough elements are generated.

uvector(0,Gen) ->
uvector(N,Gen) ->
        ?LET(Value,?SUCHTHAT(V,Gen, not lists:member(V,Values)),

A simple use of this generator could be to create unique integers or to fail on a unique vector of booleans.

9> eqc_gen:sample(example:uvector(16,eqc_gen:int())).
10> eqc_gen:sample(example:uvector(4,eqc_gen:bool())).
** exception exit: "?SUCHTHAT failed to find a value within 100 attempts."
     in function  eqc_gen:sample/1

Is that so much better? We write a property that can be used to see the distribution for different vector lengths.

prop_uvector_dist(N,Nth) ->

For example, if we make a list of 4 elements, we expect an even distribution, i.e., 10% for each number in any position in that list.

71> eqc:quickcheck(unique:prop_uvector_dist(4,1)).
OK, passed 100 tests
17% 8
16% 1
12% 2
10% 5
10% 3
9% 7
8% 10
7% 6
6% 4
5% 9

This property generates vectors of 4 elements chosen from 1 to 10 and it collects how often each of the elements occurred at the first position. We expect 10% for each, but our sample is too little. We run 100,000 tests for lists of length 4 and we check for each position. For both versions of the generator (the first naive one and the second more optimized one), we obtain 10% for each number. We conclude that the distributions are both ok in these two generators. Now we look at the time difference.

We time generation by

erlang> timer:tc(eqc_gen,sample,[example:uvector(4,eqc_gen:choose(1,10))]).

That is, with the first naive generator, we need about 1,8 milliseconds to generate 11 values. When we repeat this a few times, we see very similar values. If we do the same for the optimized generator, we see similar values as well, around 1,8 milliseconds.

It becomes more interesting the more likely it becomes to find duplicates: A sample of the first, naive generator for vectors of length 8 with values 1 to 10 is often failing, i.e., giving up, about 9 out of ten times. In case it succeeds, it takes from 25 up to 37 milliseconds. The second, optimized generator for vectors works much better. It always succeeds in around 4 milliseconds.

Thus, being a bit smart is indeed better in case your vector size is getting close to the possible numbers of values you can produce. Otherwise, it might be better to use the naive approach and make the specification very clear and keep the generator simple.