Erlang Central

Inviso tracing tutorial

From ErlangCentral Wiki



This application was removed in R16A. This documentation is retained for individuals using R15B03-1 and earlier. Consider using dbg instead.


Sean McEvoy


The inviso trace tool is part of the OTP release and offers a very powerful method to analyze and debug distributed erlang applications.

It is built on the same erlang trace BIFs as the better known dbg but has more options for handling trace information and is designed with distributed systems in mind.

While dbg is probably still a better tool for ad-hoc tracing on a newly found fault, due to its simpler interface, inviso is much more powerful if you want to design an automatic test. System profiling, white-box testing and property testing are all easier with inviso.

Here I will outline how I use its basic functionality.


To use inviso each erlang node in your system must have the runtime_tools application started. These runtime components will run on each node in the system under test and they will all be controlled from one erlang node which may be part of the system itself, or may be a seperate dedicated test node.

Personally I like the idea of running traces from a seperate node outside the system. This way the test is like an external observer who can still see into every internal detail of the system. Once it is finished it can be closed and can leave the system in it's original condition as if it had never been touched. The performance impact on the system is small enough but using this method the configuration impact can be zero.

On the other hand, there may be applications where it is useful to have one node in the system run some self-diagnosis on the system. In this situation the control component can be part of the system.


The best way I can explain how to use inviso is to show a basic example.

For demonstration purposes, start an erlang node with sname server, compile and start the following module in it:

start() ->
    Pid = spawn(?MODULE,loop,[[]]),
stop() -> server ! stop.
loop(Data) ->
        {put,From,Ting} -> From ! whatis(atom_to_list(Ting)),
                           loop(Data ++ [Ting]);
        {get,From}      -> From ! {ok,Data},
        stop            -> stopped;
        clear           -> loop([])
whatis(List) when hd(List) =< 96 -> nok;
whatis(List) when hd(List) == 99 -> nok; %% Intentional bug for demonstration,
                                         %% all others are accidental
whatis(List) when hd(List) == 100 -> ok;
whatis(List) when hd(List) == 101 -> ok;
whatis(List) when hd(List) >= 123 -> nok;
whatis(List) when is_integer(hd(List)) -> ok.

Start a second erlang node with sname client, compile the following module and call the init() function:

init() -> init(server_node()).
init(Node) ->
server_node() ->
    {ok,HostName} = inet:gethostname(),
    list_to_atom("server@" ++ HostName).
get() ->
    erlang:send({server,server_node()}, {get,self()}),
    receive {ok,Data} -> {ok,Data}
    after 1000   -> no_reply
put(Ting) ->
    erlang:send({server,server_node()}, {put,self(),Ting}),
    receive ok -> ok
    after 1000 -> no_reply

Start a third node with sname test, we will use this node to test the client/server behaviour on the other two nodes. Make sure it can communicate with the node in the system we want to test (client & server) and start inviso:


Adding Nodes

The first step in setting up a trace is to add the nodes that you want to trace on. In our example we have two nodes to add, enter this command in the test node (all inviso commands in this article are entered in the test node):


Here we have the option of specifying an overload protector. This is a function we can write which will decide if a node is overloaded or not. If it is the tracing will be suspended on that node to avoid creating further workload. I'll keep this example simple but if this functionality is useful in your application it is fully documented in the online manual.

Initialise Tracing

Once we've specified which nodes we want to trace we need to specify what we want to do with the traced events. We have several options.

1. Log all trace events to file:


These log files will then be located in the working directory of the node.

2. Display all trace events in the shell of the node where they occur:


3. Send all trace events back to the test node, we need to add the test node first if we want this option:


4. Send all trace events to custom functions:

                         [{trace,{fun(Trace,State) ->
                                  State + 1 end, 0}}]},
                         [{trace,{fun(Trace,State) ->
                                   State + 1 end, 0}}]}

These custom functions are initialised with state information, 0 in this example. Each trace event that occurs causes this function to be called with two parameters, the trace event and the state information. The function returns a new state so that information can accumulate over trace events. In this example they're only printed & counted but see the heading profiler below for a more advanced example of what can be achieved.

Set Trace Flags

Next we need to decide what events we want to trace. In order to trace on certain function calls I could make the calls:


These will set trace patterns on each call to server:loop, on each call to any function in the client.erl module and on each of these functions returning.

Setting the flag for each call will result in trace messages being generated each time the execution matches one of these traced functions, setting the timestamp flag will result in an erlang:now() timestamp being added to each message.

Instead, I might want to observe all messages sent by the two nodes.


The first call here links all the registered process aliases to their process identifiers. This makes it possible to reference the "server" process in the next function and set flags for each message sent and received. On the client node there is no registered process so I choose to set flags for each new process. The timestamp flag is also added for each call.

After this step everything should be configured properly and the specified tracing messages should be generated and handled in the specified way.

Basic Profiler

To show the possibilities of this tool I've quickly implemented a basic profiler. To run this add the nodes as described above, including the test node. Then initialise the tracing like this:

                        {fun({trace_ts,Pid,call,{M,F,_A},{MgS,S,McS}},CList) ->
                             Call = {{M,F,Pid},(McS+(S*1000000)+
                             CList ++ [Call];
                                                        {MgS,S,McS}},CList) ->
                              FTime = (McS+(S*1000000)+(MgS*1000000000000)),
                              case lists:keysearch({M,F,Pid},1,CList) of
                                  {value,{_,STime}} ->
                                      ETime = FTime - STime,
                                      io:format("~n~p:~p~nExecutionTime: ~p~n~n",
                                  _ -> ok %% Function called before trace started
                            (Msg,CList) ->
                                io:format("~nOther Message~n~p~n~p~n",[Msg,State]),

This tells the client and server nodes to relay all trace messages back to the test node. Once they reach the test node they are handled by a customised function that implements a simple profiler, printing out how long each function takes to execute.

Instead of this you could design your own property checker or even try to mimic the system under test, the possibilities are endless! ;-)

Set trace flags on the function calls as shown above and we are ready to profile our code.

Now, in the shell of the client when I enter the commands:

(client@linux)18> client:put(dis).
(client@linux)19> client:put(dat).
(client@linux)20> client:put(deze).
(client@linux)21> client:put(doze).
(client@linux)22> client:get().

In the test node I see the execution time for each function call, each number is in micro-seconds:

ExecutionTime: 232

ExecutionTime: 29

ExecutionTime: 188

ExecutionTime: 1834

ExecutionTime: 17

ExecutionTime: 176

ExecutionTime: 1625

ExecutionTime: 189

ExecutionTime: 15

ExecutionTime: 1728

ExecutionTime: 181

ExecutionTime: 1473

ExecutionTime: 10

ExecutionTime: 219

ExecutionTime: 188

ExecutionTime: 1332


If you're finished tracing and want leave the system running you'll have to close down everything cleanly and unset all the flags:


If you chose to log everything to file you may want to merge logfiles from different nodes into one file. The merge function will use the timestamps on each log entry and merge them chronologically so that the execution flow can be easily seen across different nodes. This is particularly useful for tracing messages sent between nodes or remote procedure calls to other nodes. In this example the output file will be in the working directory of the test node.



While using the client I notice that I get an unexpected return value for one put operation:

(client@linux)24> client:put(apple).
(client@linux)25> client:put(branch).
(client@linux)26> client:put(cat).
(client@linux)27> client:put(elf).
(client@linux)28> client:get().

To find out what's wrong I set up some traces from the test node:

(test@linux)153> inviso:add_nodes(['server@linux','client@linux'],mytag,[]).
(test@linux)154> inviso:add_nodes([node()],mytag,[]).
(test@linux)155> inviso:init_tracing([{'client@linux',
(test@linux)156> inviso:tpl(['server@linux'],server,'_','_',
(test@linux)157> inviso:tpl(['client@linux'],client,'_','_',
(test@linux)158> inviso:tf(all,[call,timestamp]).

I repeat the 4 put operations again and this time I can observe the traces in the test node:

(<6433.242.0>) call server:whatis("apple")
(<6433.242.0>) returned from server:whatis/1 -> ok
(<6433.242.0>) call server:loop([dis,dat,deze,doze,apple,branch,cat,elf,apple])
(<6381.39.0>) call client:put(apple)
(<6381.39.0>) call client:server_node()
(<6381.39.0>) returned from client:server_node/0 -> 'server@linux'
(<6381.39.0>) returned from client:put/1 -> ok
(<6433.242.0>) call server:whatis("branch")
(<6433.242.0>) returned from server:whatis/1 -> ok
(<6433.242.0>) call  server:loop([dis,dat,deze,doze,apple,branch,cat,elf,apple,branch])
(<6381.39.0>) call client:put(branch)
(<6381.39.0>) call client:server_node()
(<6381.39.0>) returned from client:server_node/0 -> 'server@linux'
(<6381.39.0>) returned from client:put/1 -> ok
(<6433.242.0>) call server:whatis("cat")
(<6433.242.0>) returned from server:whatis/1 -> nok
(<6433.242.0>) call  server:loop([dis,dat,deze,doze,apple,branch,cat,elf,apple,branch,cat])
(<6381.39.0>) call client:put(cat)
(<6381.39.0>) call client:server_node()
(<6381.39.0>) returned from client:server_node/0 -> 'server@linux'
(<6381.39.0>) returned from client:put/1 -> no_reply
(<6381.39.0>) call client:put(elf)
(<6381.39.0>) call client:server_node()
(<6381.39.0>) returned from client:server_node/0 -> 'server@linux'
(<6381.39.0>) returned from client:put/1 -> ok
(<6433.242.0>) call server:whatis("elf")
(<6433.242.0>) returned from server:whatis/1 -> ok
(<6433.242.0>) call  server:loop([dis,dat,deze,doze,apple,branch,cat,elf,apple,branch,cat,elf])
(<6381.39.0>) call client:get()
(<6381.39.0>) call client:server_node()
(<6381.39.0>) returned from client:server_node/0 -> 'server@linux'
(<6381.39.0>) returned from client:get/0 ->  {ok,[dis,dat,deze,doze,apple,branch,cat,elf,apple,branch,cat,elf]}
(<6433.242.0>) call  server:loop([dis,dat,deze,doze,apple,branch,cat,elf,apple,branch,cat,elf])

Note that all these messages are not perfectly in order. They are displayed in the shell in the same order as they are received by the test node, since they come from different nodes they are not guaranteed to arrive in order. Their timestamps (not shown here) are given to them immediately when they occur on their original nodes so if you really need them in order it's probably best to log then to file and merge them later.

However, for our needs right now this data is sufficient to see the fault. It seems that the server:whatis function returns nok when called with the atom cat while it returns ok during each other put operation. Analyzing further we see that the put operation only handles atoms and the whatis function provides a way of doing different actions depending on the first character of the atom, it seems there was a copy-paste error in the special case for "c".


As mentioned earlier inviso provides some overload protection functionality that is not covered in this article. It is easy to use and documented in the online user guide, see the link below.

There is also a function to configure inviso to start tracing on a node during startup. Sometimes an error occurs during startup so you can't set up tracing to observe it while as happens. With inviso there's an function to create a text file specifying what traces to set up. This is read during the next restart and the next time the error occurs the trace should get some extra data.

There is also some meta tracing functionality provided that is not covered here. If anyone wants to expand this article with examples of this feature I'll be an interested reader!

I hope you find this article helpful, inviso looks like a great tool! Another thumbs up for the OTP team! Feel free to point out or correct any assumptions or omissions I've made.

See Also

User Guide

Man Page