Erlang Central

Fast TCP sockets

From ErlangCentral Wiki

Contents

Summary

Handling TCP connections can become a bottleneck when the message rate approaches 10k rps: effectively reading from and writing to sockets becomes a problem itself, while most CPU cores may remain idle.

In this article, I suggest optimizations to improve three aspects of TCP handling: accepting connections, receiving messages, and responding to them. The article is addressed to Erlang users as well as everyone interested in Erlang. The provided code pieces are small and well explained, so deep knowledge of Erlang is not required.

By “Handling TCP” I mean three things:

  • 1. Accepting connections
  • 2. Receiving messages
  • 3. Responding to messages

Depending on your task, each or all of these aspects may be critical. I’m going to cover all three.

I’m reviewing two approaches to handling TCP—directly with gen_tcp [1] and with ranch [2], the most popular Erlang library for socket programming. Some of the suggested optimizations are applicable in both cases, some not.

To benchmark the optimizations, I’m using MZBench [3] with a worker that implements connect and request functions along with additional synchronization functions. I have two testing scenarios: “fast_connect” and “fast_receive.” The first is trying to establish a lot of connections, the second is trying to send a lot of packets using established connections. Each testing scenario is running on an Amazon c4.2xlarge node.

The scenarios and the MZBench worker code are available on GitHub [4].

Accepting Connections

Accepting connections quickly is important when you have to handle lots of client sessions with few messages in each, for example if your clients have limited lifetime or doesn’t support long-living connections.

Optimizing ranch

The simplest way to create a TCP service is via ranch, which is a popular TCP acceptor pool. Let’s change the default ranch echo service [5] to reply “ok” to any packet and benchmark it. Here is the diff that I applied to the original echo-server code example:

--- a/examples/tcp_echo/src/echo_protocol.erl
+++ b/examples/tcp_echo/src/echo_protocol.erl
@@ -16,8 +16,8 @@ init(Ref, Socket, Transport, _Opts = []) ->
 
 loop(Socket, Transport) ->
        case Transport:recv(Socket, 0, 5000) of
-               {ok, Data} ->
-                       Transport:send(Socket, Data),
+               {ok, _Data} ->
+                       Transport:send(Socket, <<"ok">>),
                        loop(Socket, Transport);
                _ ->
                        ok = Transport:close(Socket)

--- a/examples/tcp_echo/src/tcp_echo_app.erl
+++ b/examples/tcp_echo/src/tcp_echo_app.erl
@@ -11,8 +11,8 @@
 %% API.
 
 start(_Type, _Args) ->
-       {ok, _} = ranch:start_listener(tcp_echo, 1,
-               ranch_tcp, [{port, 5555}], echo_protocol, []),
+       {ok, _} = ranch:start_listener(tcp_echo, 100,
+               ranch_tcp, [{port, 5555}, {max_connections, infinity}], echo_protocol, []),
        tcp_echo_sup:start_link().

We start with running “fast_connect” benchmark:

[6]

The chart on the left shows one spike at 214ms, the other lines below it are percentiles, updated every 5 seconds. The right one indicates the rps rate, which in this partiсular case is equal to the number of open connections since only one message is sent over each connection.

Increasing the connection rate reveals the following results:

[7]

Each spikes with latency of 1000 msec means that the script ran into timeout. If we continue to increase the message rate in this configuration, the spikes appear even more often. First spikes appear at about 5k rps and I constantly have them for rate higher than 11k rps.

Replace timeout in Receive Call with timer:sleep()

I found that removing the timeout param from the receive call speeds up connection acceptance. In order to not poll the socket with infinite checks, I added a timer:sleep(20) call:

--- a/examples/tcp_echo/src/echo_protocol.erl
+++ b/examples/tcp_echo/src/echo_protocol.erl
@@ -15,10 +15,11 @@ init(Ref, Socket, Transport, _Opts = []) ->
        loop(Socket, Transport).
 
 loop(Socket, Transport) ->
-       case Transport:recv(Socket, 0, 5000) of
-               {ok, Data} ->
-                       Transport:send(Socket, Data),
+       case Transport:recv(Socket, 0, 0) of
+               {ok, _Data} ->
+                       Transport:send(Socket, <<"ok">>),
                        loop(Socket, Transport);
+                {error, timeout} -> timer:sleep(20), loop(Socket, Transport);
                _ ->
                        ok = Transport:close(Socket)
        end.

With this optimization, the ranch app can accept more connections per second, first 1000 msec timeout appeared around 11k rps:

[8]

We observe long timeout spikes as we increase the message rate. So the final number for ranch is 24k rps.

  • Conclusion
  • With the suggested optimization, we gained 3x improvement in accepting connections, going from 11k to 24k rps.

Optimizing gen_tcp

Here’s a pure gen_tcp implementation of the same ranch script we considered earlier (simple.erl in the repo):

-export([service/1]).

-define(Options, [
    binary,
    {backlog, 128},
    {active, false},
    {buffer, 65536},
    {keepalive, true},
    {reuseaddr, true}
]).

-define(Timeout, 5000).

main([Port]) ->
    {ok, ListenSocket} = gen_tcp:listen(list_to_integer(Port), ?Options),
    accept(ListenSocket).

accept(ListenSocket) ->
    case gen_tcp:accept(ListenSocket) of
        {ok, Socket} -> erlang:spawn(?MODULE, service, [Socket]), accept(ListenSocket);
        {error, closed} -> ok
    end.

service(Socket) ->
    case gen_tcp:recv(Socket, 0, ?Timeout) of
        {ok, _Binary} -> gen_tcp:send(Socket, <<"ok">>), service(Socket);
        _ -> gen_tcp:close(Socket)
    end.

Starting the same benchmark, I had the following results:

[9]


As you can see, when the rate reaches 18k rps, connection accepting becomes unreliable. The final number for an unoptimized gen_tcp script is 18k.

Replace timeout in Receive Call with timer:sleep()

The first idea is to apply the same optimization we did for ranch:

service(Socket) ->
    case gen_tcp:recv(Socket, 0, 0) of
        {ok, _Binary} -> gen_tcp:send(Socket, <<"ok">>), service(Socket);
        {error, timeout} -> timer:sleep(20), service(Socket);
        _ -> gen_tcp:close(Socket)
    end.

Looks like we can handle 23k rps now:

[10]

Add More Accepting Processes

The second idea is to increase the number of accepting processes. It could be done by calling gen_tcp:accept from multiple threads:

main([Port]) ->
    {ok, ListenSocket} = gen_tcp:listen(list_to_integer(Port), ?Options),
    erlang:spawn(?MODULE, accept, [ListenSocket]),
    erlang:spawn(?MODULE, accept, [ListenSocket]),
    accept(ListenSocket).

Testing it under load gives us 32k rps:

[11]

As I increase load further, latencies are constantly growing.

  • Conclusion
  • Timeout optimization for gen_tcp increases the acceptance rate by 5k rps, from 18k to 23k.
  • With multiple sockets per single process, gen_tcp handles 32k rps, 1.8 times more than without optimizations.

Summary

  • 1. It is better not to use timeout param in receive calls, timer:sleep is better. This is applicable both for ranch and pure gen_tcp. For ranch, it doubles the connection acceptance rate.
  • 2. It is better to accept new connections with multiple processes. This is applicable only for pure gen_tcp. In my case it gave 40% improvement in connection acceptance rate when combined with replacing timeout for timer:sleep().

Receiving Messages

This part is about receiving a lot of messages going through a limited number of established connections. The reconnection rate is low, but I want to read the incoming messages and respond to them as fast as I can. This is common for request-intensive web applications receiving and sending data through websockets.

In this part, I establish 25k connections from multiple nodes and gradually increase the message rate.

Optimizing ranch

Here are the results for the unoptimized ranch-based solution (latency percentiles on the left, rate on the right):

[12]

With no optimizations, we ranch handles 70k rps with maximum latency of 800ms.

Increase Linux Socket Buffer Size

It is popular to tune Linux socket buffers in order to get more rps. Let’s see increasing the buffer size affects response latencies and maximum rps.

Ranch on tuned Linux:

[13]

  • Conclusion
  • Increasing Linux socket buffer size has barely any effect on the ranch-based script.

Optimizing get_tcp

Let’s check the unoptimized gen_tcp solution from the previous part:

[14]

70k rps, same as ranch.

Reduce the Number of Reading Processes

In the case above, we have 25k processes reading from sockets—one process per incoming message. Now, I’ll attempt to reduce this number and see how this change affects the receiving rate.

To reduce the number of reading processes, I attach multiple sockets to the same process by spawning 100 worker readers and sending them sockets on accept:

main([Port]) ->
    {ok, ListenSocket} = gen_tcp:listen(list_to_integer(Port), ?Options),
    Readers = [erlang:spawn(?MODULE, reader, []) || _X <- lists:seq(1, ?Readers)],
    accept(ListenSocket, Readers, []).

accept(ListenSocket, [], Reversed) -> accept(ListenSocket, lists:reverse(Reversed), []);
accept(ListenSocket, [Reader | Rest], Reversed) ->
    case gen_tcp:accept(ListenSocket) of
        {ok, Socket} -> Reader ! Socket, accept(ListenSocket, Rest, [Reader | Reversed]);
        {error, closed} -> ok
    end.

reader() -> reader([]).

read_socket(S) ->
    case gen_tcp:recv(S, 0, 0) of
        {ok, _Binary} -> gen_tcp:send(S, <<"ok">>), true;
        {error, timeout} -> true;
        _ -> gen_tcp:close(S), false
    end.

reader(Sockets) ->
    Sockets2 = lists:filter(fun read_socket/1, Sockets),
    receive
        S -> reader([S | Sockets2])
    after ?SmallTimeout -> reader(Sockets)
    end.

With this optimization, we get a huge performance boost:

[15]

Latencies are much better, overall number of requests is higher at 100k rps. Actually, this server implementation allows to run more than 120k, but latencies start to grow under such rates.

  • Conclusion
  • Attaching multiple sockets per single reading process improves receiving rate by at least 50% for a pure gen_tcp server.

Increase Linux Socket Buffer Size

Let’s apply the socket buffer optimization to a vanilla gen_tcp script:

[16]

As with ranch, we don’t see any improvement here. On the contrary, the latencies become worse and the system’s predictability suffers.

Applying this optimization to an already optimized gen_tcp server also makes for worse latencies, producing numerous spikes:

[17]

  • Conclusion
  • Pure gen_tcp solutions perform worse with increased Linux socket buffer. On the other hand, reducing the number of reading processes gives us a 50% improvement in message receiving rate.

Summary

  • 1. Ranch-based and pure gen_tcp solutions perform the same, at about 70k rps.
  • 2. Increasing Linux socket buffer size does not make things better for ranch, and in case of gen_tcp it makes things worse.
  • 3. Gen_tcp solutions with multiple sockets per process is at least 1.5 times faster than the unoptimized one and has much better response time. Unfortunately this optimization is not applicable to ranch due to its architecture.

Responding to Messages

Technically, two previous benchmark scenarios include response sending, but we haven’t optimized this process specifically. Let’s do it by applying the same optimizations to send function. I use fast_recieve scenario to bench this case.

Optimizing Timeouts and Processes

Same optimization ideas that I mentioned earlier could be applied to responding: remove timeout and reply from multiple processes. There is no timeout parameter in send function, you need to add {send_timeout, 0} when opening the socket.

Unfortunately, this optimization doesn’t change anything in my benchmarks. To check how replying from a small number of sockets helps, I implemented the following script:

-export([responder/0, service/2]).

-define(Options, [
    binary,
    {backlog, 128},
    {active, false},
    {buffer, 65536},
    {keepalive, true},
    {send_timeout, 0},
    {reuseaddr, true}
]).

-define(SmallTimeout, 50).
-define(Timeout, 5000).
-define(Responders, 200).

main([Port]) ->
    {ok, ListenSocket} = gen_tcp:listen(list_to_integer(Port), ?Options),
    Responders = [erlang:spawn(?MODULE, responder, []) || _X <- lists:seq(1, ?Responders)],
    accept(ListenSocket, Responders, []).

accept(ListenSocket, [], Reversed) -> accept(ListenSocket, lists:reverse(Reversed), []);
accept(ListenSocket, [Responder | Rest], Reversed) ->
    case gen_tcp:accept(ListenSocket) of
        {ok, Socket} -> erlang:spawn(?MODULE, service, [Socket, Responder]), accept(ListenSocket, Rest, [Responder | Reversed]);
        {error, closed} -> ok
    end.

responder() ->
    receive
        S -> gen_tcp:send(S, <<"ok">>), responder()
    after ?SmallTimeout -> responder()
    end.

service(Socket, Responder) ->
    case gen_tcp:recv(Socket, 0, ?Timeout) of
        {ok, _Binary} -> Responder ! Socket, service(Socket, Responder);
        _ -> gen_tcp:close(Socket)
    end.

In this script, responders are separated from readers; I have 25000 readers and 200 responders.

Again, this attempt to optimize processing doesn’t give any speedup compared to unoptimized gen_tcp solution from the previous part:

[18]

Tuning Erlang

When a single process is used for multiple sockets, one slow socket would block all the others. To avoid this situation, I set {send_timeout, 0} option when opening a socket to respond. If the response fails, I remember the data and try later.

Unfortunately, send function does not return the number of bytes sent. It either returns a POSIX or Erlang error or an “ok” atom. This makes it impossible to know how much data we need to resend if the sending fails. On the other hand, knowing how many bytes were successfully sent lets us send only the missing bytes, thus effectively using the network. This is particularly important if we’re dealing with slow connections.

Here’s how you can fix this issue:

Download the Erlang sources from the official website:
$ wget http://erlang.org/download/otp_src_18.2.1.tar.gz
$ tar -xf otp_src_18.2.1.tar.gz
$ cd otp_src_18.2.1

Update inet driver function in erts/emulator/drivers/common/inet_drv.c: Add the ability to respond with an integer number by adding this function:

static int inet_reply_ok_int(inet_descriptor* desc, int Val)
{
    ErlDrvTermData spec[2*LOAD_ATOM_CNT + 2*LOAD_PORT_CNT + 2*LOAD_TUPLE_CNT];
    ErlDrvTermData caller = desc->caller;
    int i = 0;

    i = LOAD_ATOM(spec, i, am_inet_reply);
    i = LOAD_PORT(spec, i, desc->dport);
    i = LOAD_ATOM(spec, i, am_ok);
    i = LOAD_INT(spec, i, Val);
    i = LOAD_TUPLE(spec, i, 2);
    i = LOAD_TUPLE(spec, i, 3);
    ASSERT(i == sizeof(spec)/sizeof(*spec));

    desc->caller = 0;
    return erl_drv_send_term(desc->dport, caller, spec, i);
}

Remove sending “ok” atom from tcp_inet_commandv function:

       else
            inet_reply_error(INETP(desc), ENOTCONN);
    }
    else if (desc->tcp_add_flags & TCP_ADDF_PENDING_SHUTDOWN)
        tcp_shutdown_error(desc, EPIPE);
>>    else tcp_sendv(desc, ev);
    DEBUGF(("tcp_inet_commandv(%ld) }\r\n", (long)desc->inet.port));
}

Add sending int instead of returning 0 in tcp_sendv function:

    default:
         if (len == 0)
>>             return inet_reply_ok_int(desc, 0);
         h_len = 0;
         break;
     }
-----------------------------------
       else if (n == ev->size) {
            ASSERT(NO_SUBSCRIBERS(&INETP(desc)->empty_out_q_subs));
>>            return inet_reply_ok_int(desc, n);
        }
        else {
            DEBUGF(("tcp_sendv(%ld): s=%d, only sent "
                    LLU"/%d of "LLU"/%d bytes/items\r\n",
                    (long)desc->inet.port, desc->inet.s,
                    (llu_t)n, vsize, (llu_t)ev->size, ev->vsize));
        }

        DEBUGF(("tcp_sendv(%ld): s=%d, Send failed, queuing\r\n",
                (long)desc->inet.port, desc->inet.s));
        driver_enqv(ix, ev, n);
        if (!INETP(desc)->is_ignored)
            sock_select(INETP(desc),(FD_WRITE|FD_CLOSE), 1);
    }
>>    return inet_reply_ok_int(desc, n);

Run /configure && make && make install.

And that’s it, now Erlang’s gen_tcp:send function will return {ok, Number} on success. Now the following piece of code outputs “9”:

   {ok, Sock} = gen_tcp:connect(SomeHostInNet, 5555,
                                 [binary, {packet, 0}]),
    {ok, N} = gen_tcp:send(Sock, "Some Data"),
    io:format("~p", [N])
  • Conclusion
  • 1. If you serve multiple connections within a single socket, use {send_timeout, 0} when creating the socket, otherwise “One drop of poison infects the whole tun of wine.”
  • 2. If your protocol can process partial messages, it is better to patch OTP in order to know the actual amount of data received by a peer.

Cheatsheet

  • 1. If you need fast tcp accepts, use multiple processes per socket.
  • 2. If you need fast socket reads, use multiple sockets per process, don’t use Ranch.
  • 3. Don’t increase your socket buffer size—system stability will suffer and you won’t get much profit.
  • 4. Use zero-timeout send if you have multiple sockets per process.
  • 5. Patch OTP to know the amount of bytes sent.

Useful Links