0.5 mln packets per second with Erlang 
Nov 22, 2014 
Maxim Kharchenko 
CTO/Cloudozer LLP
The road map 
• Erlang on Xen intro 
• LINCX project overview 
• Speed-related notes 
– Arguments are registers 
– ETS tables are (mostly) ok 
– Do not overuse records 
– GC is key to speed 
– gen_server vs. barebone process 
– NIFS: more pain than gain 
– Fast counters 
– Static compiler? 
• Q&A
Erlang on Xen a.k.a. LING 
• A new Erlang platform that runs without OS 
• Conceived in 2009 
• Highly-compatible with Erlang/OTP 
• Built from scratch, not a “port” 
• Optimized for low startup latency 
• Open sourced in 2014 (github.com/cloudozer/ling) 
• Local and remote builds 
Go to erlangonxen.org
Zerg demo: zerg.erlangonxen.org
The road map 
• Erlang on Xen intro 
• LINCX project overview 
• Speed-related notes 
– Arguments are registers 
– ETS tables are (mostly) ok 
– Do not overuse records 
– GC is key to speed 
– gen_server vs. barebone process 
– NIFS: more pain than gain 
– Fast counters 
– Static compiler? 
• Q&A
LINCX: project overview 
• Started in December, 2013 
• Initial scope = porting LINC-Switch to LING 
• High degree of compatibility demonstrated for LING 
• Extended scope = fix LINC-Switch fast path 
• Beta version of LINCX open sourced on March 3, 2014 
• LINCX runs 100x faster than the old code 
LINCX repository: 
github.com/FlowForwarding/lincx
Raw network interfaces in Erlang 
• LING adds raw network interfaces: 
Port = net_vif:open(“eth1”, []), 
port_command(Port, <<1,2,3>>), 
receive 
{Port,{data,Frame}} > ‐ 
... 
• Raw interface receives whole Ethernet frames 
• LINCX uses standard gen_tcp for the control connection and net_vif - 
for data ports 
• Raw interfaces support mailbox_limit option - packets get dropped if 
the mailbox of the receiving process overflows: 
Port = net_vif:open(“eth1”, [{mailbox_limit,16384}]), 
...
Testbed configuration 
* Test traffic goes between vm1 and vm2 
* LINCX runs as a separate Xen domain 
* Virtual interfaces are bridged in Dom0
IXIA confirms 460kpps peak rate 
• 1GbE hw NICs/128 byte packets 
• IXIA packet generator/analyzer
Processing delay and low-level stats 
• LING can measure a processing delay for a packet: 
1> ling:experimental(processing_delay, []). 
Processing delay statistics: 
Packets: 2000 Delay: 1.342us +‐ 0.143 (95%) 
• LING can collect low-level stats for a network interface: 
1> ling:experimental(llstat, 1). %% stop/display 
Duration: 4868.6ms 
RX: interrupts: 69170 (0 kicks 0.0%) (freq 14207.4/s period 70.4us) 
RX: reqs per int: 0/0.0/0 
RX: tx buf freed per int: 0/8.5/234 
TX: outputs: 1479707 (112263 kicks 7.6) (freq 303928.8/s period 3.3us) 
TX: tx buf freed per int: 0/0.6/113 
TX: rates: 303.9kpps 3622.66Mbps avg pkt size 1489.9B 
TX: drops: 12392 (freq 2545.3/s period 392.9us) 
TX: drop rates: 2.5kpps 30.26Mbps avg pkt size 1486.0B
The road map 
• Erlang on Xen intro 
• LINCX project overview 
• Speed-related notes 
– Arguments are registers 
– ETS tables are (mostly) ok 
– Do not overuse records 
– GC is key to speed 
– gen_server vs. barebone process 
– NIFS: more pain than gain 
– Fast counters 
– Static compiler? 
• Q&A
Arguments are registers 
animal(batman = Cat, Dog, Horse, Pig, Cow, State) > ‐ 
feed(Cat, Dog, Horse, Pig, Cow, State); 
animal(Cat, deli = Dog, Horse, Pig, Cow, State) > ‐ 
pet(Cat, Dog, Horse, Pig, Cow, State); 
... 
• Many arguments do not make a function any slower 
• But do not reshuffle arguments: 
%% SLOW 
animal(batman = Cat, Dog, Horse, Pig, Cow, State) > ‐ 
feed(Goat, Cat, Dog, Horse, Pig, Cow, State); 
...
ETS tables are (mostly) ok 
• A small ETS table lookup = 10x function activations 
• Do not use ets:tab2list() inside tight loops 
• Treat ETS as a database; not a pool of global variables 
• 1-2 ETS lookups on the fast path are ok 
• Beware that ets:lookup(), etc create a copy of the data on the heap of 
the caller, similarly to message passing
Do not overuse records 
• selelement() creates a copy of the tuple 
• State#state{foo=Foo1,bar=Bar1,baz=Baz1} creates 3(?) 
copies of the tuple 
• Use tuples explicitly in performance-critical sections to control 
the heap footprint of the code: 
%% from 9p.erl 
mixer({rauth,_,_}, {tauth,_,Afid,_,_}, _) ‐> {write_auth,AFid}; 
mixer({rauth,_,_}, {tauth,_,Afid,_,_,_}, _) ‐> {write_auth,AFid}; 
mixer({rwrite,_,_}, _, initial) ‐> start_attaching; 
mixer({rerror,_,_}, _, initial) ‐> auth_failed; 
mixer({rlerror,_,_}, _, initial) ‐> auth_failed; 
mixer({rattach,_,Qid}, {tattach,_,Fid,_,_,Aname,_}, initial) > ‐ 
{attach_more,Fid,AName,qid_type(Qid)}; 
mixer({rclunk,_}, {tclunk,_,Fid}, initial) ‐> {forget,Fid};
Garbage collection is key to speed 
• Heap is a list of chunks 
• 'new heap' is close to its head, 'old heap' - to its tail 
proc_t 
• A GC run takes 10μs on average 
• GC may run 1000s times per second 
HTOP 
...
How to tackle GC-related issues 
• (Priority 1) Call erlang:garbage_collect() at strategic points 
• (Priority 2) For the fastest code avoid GC completely – restart the fast 
process regularly: 
spawn(F, [{suppress_gc,true}]), %% LING ‐only 
• (Priority 3) Use fullsweep_after option
gen_server vs barebone process 
• Message passing using gen_server:call() is 2x slower than Pid ! Msg 
• For speedy code prefer barebone processes to gen_servers 
• Design Principles are about high availability, not high performance
NIFs: more pain than gain 
• A new principle of Erlang development: do not use NIFs 
• For a small performance boost, NIFs undermine key properties of 
Erlang: reliability and soft-realtime guarantees 
• Most of the time Erlang code can be made as fast as C 
• Most of performance problems of Erlang are traceable to NIFs, or 
external C libraries, which are similar 
• Erlang on Xen does not have NIFs and we do not plan to add them
Fast counters 
• 32-bit or 64-bit unsigned integer counters with overflow - trivial in C, 
not easy in Erlang 
• FIXNUMs are signed 29-bit integers, BIGNUMs consume heap and are 
10-100x slower 
• Use two variables for a counter? 
foo(C1, 16#ffffff, ...) -> foo(C1+1, 0, ...); 
foo(C1, C2, ...) ‐ > foo(C1, C2+1, ...); 
... 
• LING has a new experimental feature – fast counters: 
erlang:new_counter(Bits) ‐ > Ref 
erlang:increment_counter(Ref, Incr) 
erlang:read_counter(Ref) 
erlang:release_counter(Ref)
Future: static compiler for Erlang 
• Scalars and algebraic types 
• Structural types only – no nominal types 
• Target compiler efficiency not static type checking 
• A middle ground between: 
• “Type is a first class citizen” (Haskell) 
• “A single type is good enough” (Python, Erlang)
Future: static compiler for Erlang - 2 
• Challenges: 
• Pattern matching compilation 
• Type inference for recursive types 
y = {(unit | y), x, (unit | y)} 
y = nil | {x, y} 
• Work started in 2013 
• Currently the compiler is at the proof-of-concept stage
Questions 
?? 
? 
e-mail: maxim.kharchenko@gmail.com

Максим Харченко. Erlang lincx

  • 1.
    0.5 mln packetsper second with Erlang Nov 22, 2014 Maxim Kharchenko CTO/Cloudozer LLP
  • 2.
    The road map • Erlang on Xen intro • LINCX project overview • Speed-related notes – Arguments are registers – ETS tables are (mostly) ok – Do not overuse records – GC is key to speed – gen_server vs. barebone process – NIFS: more pain than gain – Fast counters – Static compiler? • Q&A
  • 3.
    Erlang on Xena.k.a. LING • A new Erlang platform that runs without OS • Conceived in 2009 • Highly-compatible with Erlang/OTP • Built from scratch, not a “port” • Optimized for low startup latency • Open sourced in 2014 (github.com/cloudozer/ling) • Local and remote builds Go to erlangonxen.org
  • 4.
  • 5.
    The road map • Erlang on Xen intro • LINCX project overview • Speed-related notes – Arguments are registers – ETS tables are (mostly) ok – Do not overuse records – GC is key to speed – gen_server vs. barebone process – NIFS: more pain than gain – Fast counters – Static compiler? • Q&A
  • 6.
    LINCX: project overview • Started in December, 2013 • Initial scope = porting LINC-Switch to LING • High degree of compatibility demonstrated for LING • Extended scope = fix LINC-Switch fast path • Beta version of LINCX open sourced on March 3, 2014 • LINCX runs 100x faster than the old code LINCX repository: github.com/FlowForwarding/lincx
  • 7.
    Raw network interfacesin Erlang • LING adds raw network interfaces: Port = net_vif:open(“eth1”, []), port_command(Port, <<1,2,3>>), receive {Port,{data,Frame}} > ‐ ... • Raw interface receives whole Ethernet frames • LINCX uses standard gen_tcp for the control connection and net_vif - for data ports • Raw interfaces support mailbox_limit option - packets get dropped if the mailbox of the receiving process overflows: Port = net_vif:open(“eth1”, [{mailbox_limit,16384}]), ...
  • 8.
    Testbed configuration *Test traffic goes between vm1 and vm2 * LINCX runs as a separate Xen domain * Virtual interfaces are bridged in Dom0
  • 9.
    IXIA confirms 460kppspeak rate • 1GbE hw NICs/128 byte packets • IXIA packet generator/analyzer
  • 10.
    Processing delay andlow-level stats • LING can measure a processing delay for a packet: 1> ling:experimental(processing_delay, []). Processing delay statistics: Packets: 2000 Delay: 1.342us +‐ 0.143 (95%) • LING can collect low-level stats for a network interface: 1> ling:experimental(llstat, 1). %% stop/display Duration: 4868.6ms RX: interrupts: 69170 (0 kicks 0.0%) (freq 14207.4/s period 70.4us) RX: reqs per int: 0/0.0/0 RX: tx buf freed per int: 0/8.5/234 TX: outputs: 1479707 (112263 kicks 7.6) (freq 303928.8/s period 3.3us) TX: tx buf freed per int: 0/0.6/113 TX: rates: 303.9kpps 3622.66Mbps avg pkt size 1489.9B TX: drops: 12392 (freq 2545.3/s period 392.9us) TX: drop rates: 2.5kpps 30.26Mbps avg pkt size 1486.0B
  • 11.
    The road map • Erlang on Xen intro • LINCX project overview • Speed-related notes – Arguments are registers – ETS tables are (mostly) ok – Do not overuse records – GC is key to speed – gen_server vs. barebone process – NIFS: more pain than gain – Fast counters – Static compiler? • Q&A
  • 12.
    Arguments are registers animal(batman = Cat, Dog, Horse, Pig, Cow, State) > ‐ feed(Cat, Dog, Horse, Pig, Cow, State); animal(Cat, deli = Dog, Horse, Pig, Cow, State) > ‐ pet(Cat, Dog, Horse, Pig, Cow, State); ... • Many arguments do not make a function any slower • But do not reshuffle arguments: %% SLOW animal(batman = Cat, Dog, Horse, Pig, Cow, State) > ‐ feed(Goat, Cat, Dog, Horse, Pig, Cow, State); ...
  • 13.
    ETS tables are(mostly) ok • A small ETS table lookup = 10x function activations • Do not use ets:tab2list() inside tight loops • Treat ETS as a database; not a pool of global variables • 1-2 ETS lookups on the fast path are ok • Beware that ets:lookup(), etc create a copy of the data on the heap of the caller, similarly to message passing
  • 14.
    Do not overuserecords • selelement() creates a copy of the tuple • State#state{foo=Foo1,bar=Bar1,baz=Baz1} creates 3(?) copies of the tuple • Use tuples explicitly in performance-critical sections to control the heap footprint of the code: %% from 9p.erl mixer({rauth,_,_}, {tauth,_,Afid,_,_}, _) ‐> {write_auth,AFid}; mixer({rauth,_,_}, {tauth,_,Afid,_,_,_}, _) ‐> {write_auth,AFid}; mixer({rwrite,_,_}, _, initial) ‐> start_attaching; mixer({rerror,_,_}, _, initial) ‐> auth_failed; mixer({rlerror,_,_}, _, initial) ‐> auth_failed; mixer({rattach,_,Qid}, {tattach,_,Fid,_,_,Aname,_}, initial) > ‐ {attach_more,Fid,AName,qid_type(Qid)}; mixer({rclunk,_}, {tclunk,_,Fid}, initial) ‐> {forget,Fid};
  • 15.
    Garbage collection iskey to speed • Heap is a list of chunks • 'new heap' is close to its head, 'old heap' - to its tail proc_t • A GC run takes 10μs on average • GC may run 1000s times per second HTOP ...
  • 16.
    How to tackleGC-related issues • (Priority 1) Call erlang:garbage_collect() at strategic points • (Priority 2) For the fastest code avoid GC completely – restart the fast process regularly: spawn(F, [{suppress_gc,true}]), %% LING ‐only • (Priority 3) Use fullsweep_after option
  • 17.
    gen_server vs bareboneprocess • Message passing using gen_server:call() is 2x slower than Pid ! Msg • For speedy code prefer barebone processes to gen_servers • Design Principles are about high availability, not high performance
  • 18.
    NIFs: more painthan gain • A new principle of Erlang development: do not use NIFs • For a small performance boost, NIFs undermine key properties of Erlang: reliability and soft-realtime guarantees • Most of the time Erlang code can be made as fast as C • Most of performance problems of Erlang are traceable to NIFs, or external C libraries, which are similar • Erlang on Xen does not have NIFs and we do not plan to add them
  • 19.
    Fast counters •32-bit or 64-bit unsigned integer counters with overflow - trivial in C, not easy in Erlang • FIXNUMs are signed 29-bit integers, BIGNUMs consume heap and are 10-100x slower • Use two variables for a counter? foo(C1, 16#ffffff, ...) -> foo(C1+1, 0, ...); foo(C1, C2, ...) ‐ > foo(C1, C2+1, ...); ... • LING has a new experimental feature – fast counters: erlang:new_counter(Bits) ‐ > Ref erlang:increment_counter(Ref, Incr) erlang:read_counter(Ref) erlang:release_counter(Ref)
  • 20.
    Future: static compilerfor Erlang • Scalars and algebraic types • Structural types only – no nominal types • Target compiler efficiency not static type checking • A middle ground between: • “Type is a first class citizen” (Haskell) • “A single type is good enough” (Python, Erlang)
  • 21.
    Future: static compilerfor Erlang - 2 • Challenges: • Pattern matching compilation • Type inference for recursive types y = {(unit | y), x, (unit | y)} y = nil | {x, y} • Work started in 2013 • Currently the compiler is at the proof-of-concept stage
  • 22.
    Questions ?? ? e-mail: maxim.kharchenko@gmail.com