Ever since Intel and AMD have been selling multi-core cpus, the Erlang hype has been growing continuously. The number of high profile projects using Erlang were flagrantly announced over the blogosphere as the coming of the second C.
We kept hearing about: rabbitmq, couchdb, nearly all of Amazon’s AWS, Heroku’s routing grid, Facebook chat, all switching over to Erlang because it was the fast and concurrent language of the future. No other language could hold a candle to a language run by telecom giant Ericsson which was then validated by Amazon and Facebook! Or not? When I tried to find benchmarks for Erlang, they all showed otherwise.
http://muharem.wordpress.com/2007/07/31/erlang-vs-stackless-python-a-first-benchmark/
Since I initially bought into the hype, I felt compelled to test it myself to see if they made a mistake. To do this, I wrote a simple http server in Erlang, Haskell, and Python that simply outputs an HTTP reply “Pong!”. And here are the results in a graph.

The green line is the maximum req/sec possible. Higher is better.
For more details, continue reading.
Update
See comments for a faster implementation for Haskell that puts it almost on par with the Erlang.
Summary
A simple server was written in Python, Haskell, and Erlang. The server accepts any input from clients and outputs
HTTP/1.0 200 OK Content-Length: 5 Pong!
and then disconnects. The servers were benchmarked with httperf compiled with increased select limit of 65535 connections. In every test, 0 errors occurred.
Python
- 1st place
- Using a single process/thread epoll sustained the most connections per second before hitting a cpu bottleneck.
- This was highly unexpected since we are comparing it with Erlang and Haskell.
Erlang
- 2nd place.
- Having SMP/multicore enabled reduced requests/sec by a factor of 4!
- Enabling kernel polling (epoll) made a negligble performance impact (less than +- 1%).
- Someone suggested enabling “active” receive mode which asynchronously puts received packets in the Erlang message queue. This made a negligible difference.
Haskell
- 3rd place
- This might be because it uses select instead of epoll. However, this did not make a difference for Erlang, so I suspect it would not for Haskell as well.
- The program was compiled with -O2 with modest performance gains. Compiling with –threaded for SMP support reduced performance by a factor of 2!
Conclusion
DO NOT WRITE A SERVER IN ERLANG JUST BECAUSE YOU HEARD ERLANG IS THE FASTEST AND MOST CONCURRENT LANGUAGE.
Erlang is not “made for multicore”. Erlang only just received SMP support in 2006!
Setup
2 x 4 core Xeon E5420 @ 2.50GHz
The following steps were done to lift the usual limits that prevent default installations of Linux from being able to hammer servers. Neglecting this step will artificially cap concurrent connections at 1024 while timeout errors will increase. This is an important step that many other benchmarks have left out, and it shows in the error rate. All tests here resulted in 0 errors.
Httperf also depends on select, so the select limit was increased to 65,535 file descriptors.
- edit /etc/security/limits.conf and add the lines: “
* hard nofile 65535” and “* soft nofile 65535" (reboot if ulimit -n does not change) - edit /usr/include/bits/typesizes.h and change “#define __FD_SET_SIZE 1024″ to “#define __FD_SET_SIZE 65535″
- compile httperf from source
- Increase kernel file descriptor limit: sudo bash -c “echo “128000″ > /proc/sys/fs/file-max
- Increase the backlog: sudo sysctl -w net.core.netdev_max_backlog=60000
- Increase maximum connection limit: sudo sysctl -w net.core.somaxconn=250000
Software
Ubuntu 9.04 x64
Erlang: BEAM 5.6.5
Haskell: GHC 6.10.4
Python: CPython 2.6.4
Httperf: 0.9.0
Httperf was run with the following settings, where port and rate were adjusted accordingly:
httperf –port=8000 –num-conns=40000 –rate=5000
Erlang active mode
-module(echo). -export([listen/1]). -define(TCP_OPTIONS, [binary, {packet, 0}, {active, true}, {reuseaddr, true}, {backlog, 60000}]). % Call echo:listen(Port) to start the service. listen(Port) -> {ok, LSocket} = gen_tcp:listen(Port, ?TCP_OPTIONS), spawn(fun() -> accept(LSocket) end). % Wait for incoming connections and spawn the echo loop when we get one. accept(LSocket) -> {ok, Socket} = gen_tcp:accept(LSocket), Pid = spawn(fun() -> loop(Socket) end), gen_tcp:controlling_process(Socket, Pid), accept(LSocket). % Echo back whatever data we receive on Socket. loop(Socket) -> receive {tcp, Socket, Data} -> gen_tcp:send(Socket, "HTTP/1.0 200 OK\r\nContent-Length: 5\r\n\r\nPong!\r\n"), gen_tcp:close(Socket); {error, eaddrinuse} -> done end.
Haskell
import IO import Control.Exception hiding (catch) import Control.Concurrent import Network import System.Posix main = withSocketsDo (installHandler sigPIPE Ignore Nothing >> main') main' = listenOn (PortNumber 9900) >>= acceptConnections acceptConnections sock = do conn@(h,host,port) <- accept sock forkIO $ catch (talk conn `finally` hClose h) (\e -> print e) acceptConnections sock talk conn@(h,_,_) = hPutStrLn h "HTTP/1.0 200 OK\r\nContent-Length: 5\r\n\r\nPong!\r\n" >> hFlush h >> hClose h
Python
import select import socket EPOLLIN = select.EPOLLIN EPOLLOUT = select.EPOLLOUT epoll = select.epoll(60000) connections = {} class Server(object): def __init__(self): sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1) sock.setblocking(0) sock.bind(('', 8050)) sock.listen(60000) self.socket = sock fileno = sock.fileno() connections[fileno] = self epoll.register(fileno, EPOLLIN) def onInput(self): sock, address = self.socket.accept() Client(sock) class Client(object): input = '' output = "HTTP/1.0 200 OK\r\nContent-Length: 5\r\n\r\nPong!\r\n" def __init__(self, sock): sock.setblocking(0) fileno = sock.fileno() epoll.register(fileno, EPOLLIN|EPOLLOUT) connections[fileno] = self self.socket = sock def onInput(self): newdata = self.socket.recv(1024) if len(newdata) is 0: self.close() self.input += newdata def onOutput(self): sent = self.socket.send(self.output) self.output = self.output[sent:] if len(self.output) is 0: self.close() def close(self): fileno = self.socket.fileno() del connections[fileno] epoll.unregister(fileno) self.socket.close() Server() while 1: for fd, event in epoll.poll(): if event & EPOLLIN: connections[fd].onInput() if event & EPOLLOUT: connections[fd].onOutput()
Source Code
Erlang version 1 (active mode): hello.erl
Erlang version 2 (passive mode): hello2.erl
Related posts:








Judging by the length of the Python example (look how low level it is!!) this kind of benchmark is highly dependent on the quality of the code in each language.
For example, in Haskell you’re doing slow String IO, instead of bytestring IO, and not using the new epoll library: http://www.serpentine.com/blog/2009/12/17/making-ghcs-io-manager-more-scalable/
Here’s a simple improvement to make the Haskell not entirely naive: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=16221#a16221
On my machine, I can get 10.2k conn/sec, while your example only does 6k sec.
The Python version is longer because light weight threading isn’t built into the core VM unlike Haskell and Erlang. I implemented the multiplexing in pure Python which should by all means be a slower language than Haskell or Erlang.
I could have implemented this in asynchat, and it would have hidden away much of the length as you can see here.
I tried to get your example to work, but it said “Could not find module `Network.Socket.ByteString’” even after I did cabal install Network. I also couldn’t find out where to download the epoll library in that link. I welcome any additions or improvements to these examples.
network-bytestring is the package you’re looking for (you can find’em using Hoogle or Hayoo)
Thanks it started working, but I ran into another problem.
hello2.hs:18:25:
Couldn’t match expected type `ByteString’
against inferred type `[Char]‘
In the second argument of `sendAll’, namely `msg’
In a stmt of a ‘do’ expression: sendAll c msg
In the expression:
do sendAll c msg
sClose c
Edit
I found out the missing code to get it to work:
import qualified Data.ByteString.Char8 as B
msg = B.pack “HTTP/1.0 200 OK\r\nContent-Length: 5\r\n\r\nPong!\r\n”
Benching results with -O2 -threaded:
rate req/s
9000 8998.9
10000 9806.4
11000 7460.2
Will you update the graphs to use the bytestring version?
@Don I’m not sure it would be fair to Erlang to do so.
First of all, this is a package that is not bundled with the standard distribution and is even labeled experimental
And second, you have to convert all your strings by packing it which is an O(n) operation.
That’s why I used {-# LANGUAGE OverloadedStrings #-} which is the idiomatic way to use bytestring literals — did you remove that from your verrsion?. This is idiomatic Haskell, so I don’t think it is fair not to use it.
BTW, bytestring isn’t “experimental” in that sense. It is in the Haskell Platform: http://hackage.haskell.org/platform/contents.html which is the only standard Haskell distribution.
@Don, if I am reading the manual correctly, all that Overloaded string does is to make the conversion from string to bytestring implicit. Behind the scenes, it still has to convert with an O(n) cost.
And while bytestring is included in that distro, the package network-bytestring does not appear to be there.
Indeed, there is a conversion, but it happens at compile time.
It’s fine if you’re not going to redo the benchmarks, but it’s a bit silly to be deciding what is and isn’t idiomatic Haskell. Using String for a high performance server isn’t idiomatic, for example, so the result meaningless.
Here’s a simple example in Haskell using the new epoll scalable IO library, available at, http://github.com/tibbe/event
The code is here, and is little changed from the naive code: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=16253#a16253
While your original example reaches 6k conn/sec on my measurements, the epoll-based version reached 15.1k conn/sec.
I’ve summarised the three Haskell versions: String-based concurrent, bytestring concurrent and epoll bytestring on a wiki page here, along with corresponding measurements: http://haskell.org/haskellwiki/Simple_Servers
I think the conclusion is that a bit of custom epoll code pushes the work into the kernel, and you’ll get much the same performance in *any* language when the program is not compute bound.
I’d be interested to see if you get similar results to mine for the Haskell epoll version.
On my machine, your Python epoll version achives 10.1 k req/sec, while the Haskell epoll version above achives 15k req/sec.
I just tested my Python again, and it is still bottlenecks at > 12k/sec. I noticed that you have num-conns at 10,000. This number is a bit low and prone to measurement errors especially when doing a rate of 10,000+. I used num-conns 40,000.
I get the same results for Haskell epoll and Python epoll with –num-cons=40000,
* Haskell epoll, Request rate: 14683.7 req/s
* Python epoll, Request rate: 10103.3 req/s
Linux 2.6.31-ARCH x86_64, ghc 6.10.4
Did you do the kernel tweaks?
So you’re basically comparing non-preemptive select vs. multiple preemptive processes, not python and erlang/haskell runtimes. Very funny, indeed.
How try this: for every input, you should talk to database. Or compose heavy webpage. Or you have a pretty slow client. Your python solution would block everyone while it processes data for one client (e.g. waits for db), while solutions in haskell and erlang would not. That’s the whole point. In python/ruby/… the real solution would consist of something like multiple single-threaded interpreter processes balanced by external server. Which is far more ugly and slow.
For comparison to be fair, you should spawn a preemptive thread (os one in python?) for each client or write your own scheduler for python (i doubt it’s possible)
Congratulations for being the first person to blatantly skim through the details. I knew it wouldn’t take very long. I tested SMP performance, which should have blown away a single thread out of the water on a 8 core system.
Since it decreased performance, I decided to turn off SMP for both Haskell and Erlang. Even after doing this, they are still not as fast as Python which is considered to be a slower language.
“Your python solution would block everyone while it processes data for one client (e.g. waits for db), while solutions in haskell and erlang would not.”
That is incorrect. You can use a thread in Python to call C modules which may release the global lock. Erlang and Haskell are not devoid of blocking function calls as well. It depends on the implementation.
> I tested SMP performance, which should have blown away a single thread out of the water on a 8 core system.
This is a common misconception, SMP does not always help. Especially when doing various no-ops.
> Since it decreased performance, I decided to turn off SMP for both Haskell and Erlang.
Turning off SMP does not turn off preemption in haskell and erlang. Their runtime performs preemption without any OS help (python can’t do that)
> That is incorrect. You can use a thread in Python to call C modules which may release the global lock.
“You can use a thread” — of course, you mean OS thread? Well, try to do something meaningful with 10000+ of these.
> Erlang and Haskell are not devoid of blocking function calls as well.
They are not, but i’m not talking about blocking system/function calls there. In your python example you block with every computation, might it use any syscalls or not, but in haskell i’m sure you could compute something like pi digits in non-blocking way.
Your comparison is incorrect because haskell is preemptive here (without any use of OS threads! — it is OS threads that are called ‘SMP’ in haskell) and python is not.
You are severely overestimating the overhead of a context switch for the lightweight threads used for Haskell and Erlang.
The cost of preempting every X opcodes in Haskell and in Erlang is negligible. It is only when you use real OS threads and processes that context switching carries a large cost.
If you want to compare this with Python’s implementation of stackless Python, the context switching overhead is merely the cost of 1 Python function call. This is not enough to cover the discrepency in speed outlined in the graph here.
> You are severely overestimating the overhead of a context switch for the lightweight threads used for Haskell and Erlang.
You are severely overestimating your (and everyone’s) knowledge of what’s happening there. For example, your cpu might lose its cache while preemting etc. When doing no-ops, everything is very slow in comparision to no-ops. When doing real work it is not, so you should do some bechmark with real work first and _then_ say that “preemption is negligible in this benchmark”.
@k.pierre.k
“You are severely overestimating your (and everyone’s) knowledge of what’s happening there. When doing no-ops, everything is very slow in comparision to no-ops. ”
I seriously doubt you have a clue what you are talking about. The cache loss in a green thread should be equivalent to implementing your own multiplexer as I have in Python. This is common knowledge to anyone who knows something about operating systems and have seen the benchmarks of Stackless Python.
Right now it sounds like you simply read about preemption off of Wikipedia and trying to apply it to every situation that has that word.
I assume that by “hype”, you mean the amount of blogs written by people who just picked up the language or read stuff about it on the web? You certainly cannot accuse Ericsson for hyping Erlang as a product, as they do not market it at all.
Haskell is arguably the most powerful programming language on earth. Speed has never been a priority, yet it is surprisingly fast. Erlang was designed for writing complex telecom systems with near-zero downtime. Speed was not a priority, as long as it is “fast enough”.
When people say that Erlang was “designed for multicore”, it is obviously strictly untrue, as Erlang pre-dates multicore by almost two decades. There was, however, a working SMP prototype for Erlang in 1997, which showed how Erlang fits naturally with SMP. It was not made into a product, since the commercial systems using Erlang were embedded systems with neither sufficient space nor power budget to house the SMP boards available at the time.
Erlang has never done that well in microbenchmarks. The advice has always been to consider your total requirements and try to find experience reports from people who’ve written real products with similar requiremets, or write representative prototypes and measure.
In your particular benchmark, I imagine that the dramatic drop is due to SYN flood protection, and the python program fares best because it’s picking off connections in a tight loop. Also, the fact that enabling multicore reduces performance for both Erlang and Haskell suggests that what your benchmark is measuring is the ability to peel connections off the socket(s) as fast as possible, in which case disabling SMP and running a tight loop with blocking semantics would be the fastest option by far.
“In your particular benchmark, I imagine that the dramatic drop is due to SYN flood protection”
My machine has no iptables settings that limit this. The kernel backlog limit was enlarged to 60,000 as I explained in detail, which means that it is impossible for this to be due to disconnection of half-open sockets when you keep in mind that the total connections is 40,000. I know this is a common flaw for many benchmarks online which is why I specifically mentioned it above.
HTTPERF also indicated 0 errors for all tests. The dropoff you are seeing is due to the saturation in CPU usage, not any connection problems.
“running a tight loop with blocking semantics would be the fastest option by far.”
I did bench Erlang and Haskell with SMP disabled, and the Python loop is not blocking at all. I am quite certain that Erlang and Haskell also utilize nonblocking sockets since forkIO does not dedicate an OS-thread, and Erlang’s active mode asynchronously receives messages in a queue.
“Erlang has never done that well in microbenchmarks.”
The problem here is that if Erlang should be fast at anything, it should be fast at being a server and handing out bytes. After all, that is what it is made for, isn’t it?
I agree with you that the code-swapping is good, but is it really necessary these days when distributed systems are spread through multiple machines that can be turned off at will? A couple of decades ago, a company like Ericsson had a handful of computers. Now companies have tens of thousands in a single datacenter.
The normal backlog setting does not affect the SYN flood protection in the IP stack (as far as I’m aware, this is not an iptables issue).
See e.g. http://www.erlang.org/cgi-bin/ezmlm-cgi/4/43624 for some more detail on this. The drop off in your graphs indicate that there is more going on than simply CPU saturation, which by itself doesn’t tend to give such dramatic performance degradation.
And no, this is not what Erlang was made for. Erlang was made for control system logic, where you do have lots of communication over sockets, but most importantly, there is coordination between lots of different “actors” – resource reservation, bandwidth regulation, billing data reporting, etc.
Most of the time, these systems are not “central office applications”, but spread out all over the place, and with fairly stringent footprint and power consumption requirements.
And what matters most in these systems is not that they are as fast as can be, but that they behave robustly and are easy to evolve and maintain. They need to be “fast enough”, and Erlang has proven to provide sufficient performance to satisfy this requirement. From what I’ve seen of Haskell and Python benchmarks, so would they. The question then is if they meet the other requirements. The answer will vary from domain to domain, just as it does for Erlang.
One of the things that tends to complicate matters a lot is when you start thinking about fault tolerance, and “recovery units”. In Erlang, for a wide range of errors, the only affected part of the system is the session where the error occurred. For a fairly ambitious study of Erlang’s performance in more realistic setting, read e.g. http://www.macs.hw.ac.uk/~dsg/telecoms/publications/erlang03.pdf
Quote from your link:
“My experiments with this is that there is a very close relation between
the backlog and how many connections / second you can handle.”
I increased the backlog limit to 60,000 as I mentioned above, so there is no syn flood protection dropping connections. And even if there was, httperf would have detected it as a timeout error.
It’s funny how you keep misrepresenting what Erlang is supposed to be good at, and then ignoring the rebuttals which clarify things, both in this thread, your article, and throughout the comments section.
@Michael
Why don’t you elaborate then? I have responded to every supposed flaw, while the detractors here keep ignoring them and repeating the same thing over and over again.
Here is a list of nonsensical arguments that I have debunked that you all keep repeating:
- Preemption at the VM level of Erlang and Haskell should be as expensive as a true OS thread switch:
Wrong, it should only cost maybe 2-3 C level function calls, with most of the registers remaining unchanged. When you compare this to an implementation in Python, there is absolutely no reason for these languages to be slower.
-This benchmark doesn’t do anything like query databases and do expensive computations:
The point of this benchmark was to test the language implementation of multiplexing. What are you supposed to do in a benchmark? You remove variables, you don’t add them.
- The graph is inaccurate because of syn protection etc…:
Wrong. I’ve tweaked the kernel to support more connections than I test with.
- The Python implementation is stateless or it isn’t a scheduler:
Wrong, the Python implementation is stateful. Every socket is attached to a Python object, and you can attach arbitrary state to it. And the accusations about not being a “real” scheduler? How do you think a scheduler is implemented? Hint: it looks like the Python script.
- Erlang wasn’t made for fast servers:
Ok, I’ll take your word for it. But it still doesn’t change the public perception. And if it’s not a suitable language for servers, then what is it good for?
Since your benchmark does no significant computation a “slow language” versus a “fast language” makes very little difference. You are benchmarking some limited aspects of the runtime system. That’s not a bad thing, but it’s good to know where the time is spent.
I gave this some thought and here is my hypothesis:
It is clear that the Python example is different from the examples of Haskell and Erlang. In the python example, we run rounds off of epoll() whereas the other system use a light-weight userland scheduler. What happens in the python code is that in each epoll-round, we will get any new connections accepted due to the onInput() dispatch in the server. Haskell and Erlang will only do this whenever the server process is scheduled to run on a core. The chance that the server process is run will dwindle when we have many processes, in particular when we accept a lot of processes and can’t complete all of them in their time slot. In contrast, no such thing will happen in the Python process. It will clog when the acceptor queue fills up and that happens when it can’t complete epoll rounds fast enough.
This hypothesis can be tested by measurement of the number of active processes in the erlang and haskell runtimes compared to when the server process runs. Haskell and Erlang dies due to the scheduler.
A knob worth trying to play with is the Erlang +A option. It controls async threads for IO use. Even a low setting like 32 should experience a speedup if there is anything to win here.
Another idea is to run the scheduler in Erlang and Haskell in a transposed fashion like epoll() does in Python. That is, we spawn a fairly small amount of processes. Each process basically runs a forever loop where it in turn: 1) accepts, 2) communicates. It may not help much though.
I don’t think the +A option will make any difference in this case. The inet driver doesn’t make use of the asynch thread pool (the driver has to do this explicitly, and e.g. the efile driver does).
Personally, I think it’s pretty impressive that Erlang is able to keep up as far as to 10K requests/sec, considering that for each incoming, it (1) spawns a process, (2) hands over control of the socket to the new process, (3) schedules it for execution, (4) the new process sends a reply when it gets a timeslice, and (5) the process is terminated and memory recovered.
…all within 100 usecs of effective cpu time.
For this tiny benchmark, Erlang’s acrobatics are clearly overkill, and there really is no way in Erlang to cut down on the overhead much. This is due to the assumption that all meaningful applications where Erlang is a suitable choice, this is a bare minimum of what you need.
“For this tiny benchmark, Erlang’s acrobatics are clearly overkill, and there really is no way in Erlang to cut down on the overhead much.”
Erlang “processes” are supposed to be lightweight. It is not a true OS thread switch. Handling over control of the socket to a new process is merely writing a (process, fileno) tuple into a hash table or tree. Scheduling of execution should then put the “pid” at the end of the scheduler’s linked list if it needs more cpu or put into the list for select or epoll to poll. None of this should be cpu intensive.
I am effectively doing the same thing that Erlang does in this Python script by creating a new Python object which costs at least 280 + 64 bytes not even including the member variables every time a client comes in. The socket is handed over by saving it as a member variable and then saving it to the global fd->client hash table where it is scheduled to be checked on upon connection. When disconnected, the client object removes itself from the global mapping, and is appropriately garbage collected.
If Erlang and Haskell run round robin schedulers, then it should have the same logic as the Python script.
There is 1 acceptor “thread” and X client “threads”. With each call of epoll (which is essentially the python scheduler here), one accept is made per loop, and in each loop the number of client “threads” accumulates.
If Erlang/Haskell is dedicating all the cpu to accepting connections instead of serving clients, this is indicative of a bad scheduler. A good scheduler would round robin through the tasks.
You are exactly at where I am at: In Python, you are essentially measuring how fast epoll is. In Haskell and Erlang you are measuring how fast select/epoll is with the added work of running the scheduler on top of that. It should come as no surprise that Python is faster in this case. “Green” threads may be cheap, but they are certainly not free.
One thing makes me wonder however: why are the Erlang and Haskell programs not receiving up to 1024 bytes as is the case for Python?
So this really is “debunking user land schedulers versus custom epoll” for a benchmark
The problem here is that if Erlang and Haskell were sanely implemented, they should be blocking on epoll or select when the threads were waiting on IO.
Otherwise, what you are suggesting is that Erlang and Haskell are wasting cycles on busy loops while blocking on IO.
And no, I am not simply measuring how fast epoll is. I am using epoll to trigger my own scheduler written in pure Python which should by all means be slower than Erlang or Haskell.
It is an interesting result: a custom user land scheduler in Python can beat general purpose preemptive threads in Erlang or Haskell, for this particular benchmark.
And in general, I would be fairly confident custom schedulers in each language will beat general purpose schedulers in each language.
I’d still love to see a custom scheduler in Erlang and Haskell as well, alongside, e.g. using the Haskell epoll user events library, http://github.com/tibbe/event/blob/master/docs/design.md
>The problem here is that if Erlang and Haskell were sanely implemented
You should at least try to consider the possibility that there is some sanity in the designs of the Haskell and Erlang runtimes, and that your small benchmarks may not hold the entire truth about performance in concurrency-related applications.
Having been fairly closely involved with the last 13 years of tuning Erlang for massively concurrent commercial systems, I can safely say that your benchmark ignores several things that are real issues in products of any significant complexity.
You choose not to believe this. Fine. You are not the first to be supremely confident that your small prototype beats mature frameworks, and that the things you’ve omitted will have no effect on the outcome, once added. Verifying your assumptions in a full-scale project will be a good learning experience, if nothing else.
Depending on the problem you’re trying to solve, your approach may indeed give better performance (and be sufficiently powerful). In order to draw more general conclusions, you need to do more work.
You’re surprised a single thread on a single core outperformed SMP and multicore implementations for an IO-bound problem? This lesson is taught in every undergraduate operating systems, concurrency and networking course.
Modify your examples to handle hundreds of thousands of processes, all seamlessly distributed over multiple nodes, gracefully handling errors between nodes… that is trivial with Erlang, and will continue to perform very well. How much code is it going to take to add those capabilities to your super-fast single-threaded utterly not fault-tolerant Python program? Lots. How fast is it going to be once you start throwing in locks to work around shared memory concurrency? Not… very.
The conclusion of this web server shootout is about as useful as racing a jetski against a tugboat, declaring the jetskis to be faster since the jetski won the race, and then deciding to pull barges into the harbour with jetskis from now on since they’re so much faster.
Please actually read the entire article. I disabled SMP support.
And no, it is not an IO bound problem. The bottleneck was the CPU while IO measured at below 1 megabyte per second.
The fact you disabled SMP support has nothing to do with it. I did read the article… several times. It starts:
“Ever since Intel and AMD have been selling multi-core cpus, the Erlang hype has been growing continuously.”
And then you “debunk” the concurrency Erlang hype by showing us a python program that runs on a SINGLE core and does not use ANY concurrency constructs like threads, processes, fibres, coroutines, (etc), WHATSOEVER.
Let me state that again, because it really is stunning… you assert Erlang’s hype for multi-core support is undeserved because a *non*-concurrent, single-core Python program ran something a little bit faster than a *concurrent* Erlang program running on a single core. *claps*
As I said, you’ve demonstrated what every undergrad in CS learns, that a single thread of execution will perform better on a single core than multiple threads are. This is well understood!
The Erlang program you posted spawns a new process for every accepted connection. Yes, Erlang processes are very lightweight, but they are not a NOOP. The Python program does not spawn a new process, or thread, or any other concurrency construct. The Python and Erlang programs consume bytes over the network at the same rate. The difference in performance is found with what the programs are doing during and between the connections. The Python program does nothing, while the Erlang program is pre-emptively scheduling thousands of processes to handle the incoming connections.
Again, there is no surprise in your results, and it certainly does nothing to damper the enthusiasm for a language and runtime built for concurrency, distribution and fault-tolerance running on distributed clusters of multi-core systems.
“And then you “debunk” the concurrency Erlang hype by showing us a python program that runs on a SINGLE core and does not use ANY concurrency constructs like threads, processes, fibres, coroutines, (etc), WHATSOEVER.”
If you don’t think a webserver shows any concurrency whatsoever, I don’t know what to tell you. Just because you can see the scheduler in Python doesn’t mean that it is magically non-concurrent.
“that a single thread of execution will perform better on a single core than multiple threads are. This is well understood!”
And you are simply incorrect when the bottleneck is the cpu. I just added a fork on the python version and it increased req/sec by 1,000. So what do you say to that?
“while the Erlang program is pre-emptively scheduling thousands of processes to handle the incoming connections.”
You need to brush up on your Erlang skills because that is not what my Erlang script does.
In your Erlang accept/1: “Pid = spawn(fun() -> loop(Socket) end)”
accept/1 spawns a process for each connection. BEAM schedules the execution of the processes running for each connection.
@Jim
Erlang is synchronously creating a worker for each client as they come in. It is not by any means pre-emptively creating these threads before the clients come in, and therefore no busy waiting is done by Erlang.
I’m not sure what part of spawn creating new Erlang processes in accept you aren’t willing to recognize and admit here. I’m not suggesting busy waiting occurs, I’m suggesting BEAM context switches between the processes and that takes cycles.
The Erlang program uses its VM’s concurrency mechanism, the Python program uses only epoll and thus only the OS’ asynchronous I/O mechansim. No Python language or VM features for concurrency are at play, only kernel-level event handling. This is not an apple-to-apple comparison.
Haskell and Erlang also uses “kernel-level event handling”. They use select and epoll.
Do you see the multiplexing part that is calling methods on the client objects? This is concurrency whether you think so or not.
It is vastly less resource-intensive (by design) to do epoll-based multiplexing than pre-emptive scheduling of hundreds or thousands of concurrent Erlang processes each doing IO.
Your comparison of epoll vs. select(or epoll (or whatever))+spawn is not even close to reasonable, and it absolutely has NOTHING to do with Erlang’s suitability in a multi-core environment, because your argument is based on Python code that does not utilize multiple cores, nevermind multiple nodes, which Erlang is designed to support as transparently as it does a single core on a single node.
Write a Python server that, for each connection, spawns a thread or whatever lighter-weight alternative you prefer to be competitive with Erlang’s lightweight processes, ensure those processes run evenly on all (or one less to avoid the ‘last core parallel slowdown’ problem on Linux) available cores, and then you can start comparing Python to Erlang and evaluating Erlang’s suitability for multi-core development.
You do everyone a disservice by suggesting multi-core performance can be characterized by these trivial programs.
Last time I checked, serving a web page per client belongs in the class of embarrassingly parallel problems.
The Python server already has lightweight “threads of execution”. It is called the onInput and onOutput method. It would probably be helpful for you to imagine these methods unified as resumeThread. This is essentially equivalent to what is happening under the hood of Erlang and Haskell.
And speaking of multicore development, you still haven’t addressed why if I put a single fork statement in the python script, effectively making it multithreaded, it adds 1000 req/sec.
I am done arguing with a wall here. I would suggest that you try to implement your own user-mode scheduler as it is clear you do not understand how they work.
You keep suggesting they’re equivalent, but they aren’t. Erlang processes are general purpose, select/epoll are not.
I can’t address the fork version of your Python script without seeing it!
Your assumption I do not understand what is happening is invalid. I have, for example, implemented a coroutine library in C, written a web server using that library, and evaluated its performance. So I’ve done these shootouts before, I know the results. They’re not a surprise. No, mine are not published, but why would they be, these results are in every textbook. They are not interesting.
Like I said, if you want to evaluate the hype around Erlang’s multi-core support, you have to dig a lot deeper than a hard-coded 5-byte HTTP response, and when comparing performance of a Python script to an Erlang program that uses thousands of Erlang processes, the Python script should also be using a general purpose concurrency mechanism, not something optimized for a single purpose like select/epoll and furthermore executing only at the kernel level and not at the language/VM level.
Since you refuse to address the obvious holes in your comparison, there is no hope for a useful conclusion here.
Good day, and good luck.
The source code is already out there for you to try. Since you claim to know Python, there is no reason for you to not try it yourself.
I don’t doubt that adding a scheduling mechanism adds overhead.
The real question is, which you keep avoiding, does it really add over 20% overhead on top of Python being a slower dynamic language? The shootout benchmarks show that Python is expected to be 3-30x slower.
Since you claim to have implemented coroutines and schedulers yourself, then you should already know that the context switch overhead is very small and equivalent to a function call.
Only a few registers are changed, and the instruction register is not loaded from an unusual point where the branch predictor fails. The cost of a few mov opcodes is negligible compared to the rest of the program. How you refuse to understand this is simply baffling.
If python does it for you, then I guess you should continue to use it.
If you claim to “debunk” then you need to cover the angles. Your arguments are totally unconvincing.
/s
I certainly am partial to Erlang, but I’m curious as to if there is any gain in performance by using the newest version of Erlang (R13B03, erts-5.7.4).
The implementation you use (R12B5, erts-5.6.5) pre-dates big improvements made to the SMP system in R13B. Using a new version might boost the performances. You might also want to try compiling with HiPE enabled, which might or might not make the code faster (you really got to test it to know).
See the SMP details and why it might make a difference here: http://www.erlang.org/doc/highlights.html
Hello MononcQc, thank you for the suggestion.
I benchmarked with R13B03 (erts-5.7.4) HIPE compiled native and it made the SMP problem go away. Now it is on par with smp disabled Erlang.
Glad to know it made it better. I’d be interested to see a follow-up article with graphs showing the relative speedups of each of your programs when adding/removing cores, if you’re ever interested in doing that.
You say funny things.
The Haskell program as it stands won’t scale up on a multicore because it only has a single accept loop, and the subtasks are too small. The cost of migrating a thread for load-balancing is too high compared to the cost of completing the request, so it’s impossible to get a speedup this way. If you create one accept loop per CPU then in principle it ought to scale, but in practice it won’t at the moment because there is only one IO manager thread calling select(). Hopefully this will be fixed as part of the ongoing epoll() work that was mentioned earlier.
Regarding the slowdown you see with -threaded, this is most likely because you’re running the accept loop in the main thread. The main thread is special – it is a “bound thread”, which means it is effectively a fully-fledged OS thread rather than a lightweight thread, and hence communication with the main thread is very expensive. Fork a subthread for the accept loop, and you should see a speedup with -threaded.
More background on a similar benchmark in this ticket: http://hackage.haskell.org/trac/ghc/ticket/3758
This would be impressive if you actually did something with the client connections.
Unfortunately for you, since Erlang and Haskell compile to machine code, the more computation you do for each client, the better they’re going to fair versus Python.
How much is enough to be considered “something”?
I guarantee you that as long as the benchmarks show Python performing faster than Haskell and Erlang, people like you will still complain that the benchmark isn’t doing “something”.
Erlang doesn’t necessarily compile down to machine code. Most of what’s going on in this benchmark is exercising the runtime system and the inet driver.
A big deal is made out of the fact that the python program hits the ceiling at 20% higher load than Erlang. Yet the python program schedules completely on socket events and is completely stateless, whereas the Erlang processes have their own thread of control and are able to maintain state through context switches as well as block individually and selectively await certain signals. While one can discuss back and forth how much overhead it should add to accomodate for that kind of functionality, the only proper way to find out is to perform the relevant benchmarks.
If all you need to do is switch on incoming packets and dispatch stateless callbacks, this benchmark shows that you can easily write a program in Python that outperforms an Erlang program. This should be encouraging news to python programmers who have exactly this problem.
Others can conclude from the figures that Erlang and Haskell do pretty well too, even though this is more low-level than the problems they were intended for. Ultimately, all three languages demonstrate that they can be pretty darn fast, even though they all have a reputation for being slow.
“Yet the python program schedules completely on socket events and is completely stateless”
That’s incorrect. The state of each socket is encoded in a Python object, which is at least 280 + 64 bytes. You can add any state you like inside this object with the appropriate instance variables or overridden methods.
The Python objects are stateless, because they don’t carry around message queues, run-time-stacks, and so forth.
And all of this still runs on one machine, so it really doesn’t show anything about how well a distributed system in each language is going to run – and “well” can mean a lot of different things depending on what one wants to achieve.
An object by it’s very nature is stateful. It already contains state for input and output buffers along with the socket.
Just because you stash data in a member variable as opposed to a C stack does not make it “stateless”. This is like arguing “a stack allocated in the heap is not really a stack”.
And about scaling to multiple computers, that is not the objective of this benchmark. Scaling to multiple cores however, was shown to have slowdowns in Erlang and Haskell.
When people talk of state in this context, they aren’t just talking about allocating memory, which all of the programs are doing of course. It’s obvious that more is going on, right? An Erlang process, an active object, is not the same as a passive object although they both take up memory. I can’t speak for the Haskell example.
As far as I can see for the Python program, you have exercised it’s ability to use the epoll of the OS, and it does so rather well, but that has nothing to do with scaling to multiple cores. That has to do with how well the OS does polls meaningfully in accordance with the specification of epoll – and from Python, and how well the hardware can do that.
It would be really interesting to see an example in Twisted (Python framework) with some kind of scheduling above to compare it to Erlang – or the other way around to see an Erlang example of a custom scheduler on top of a call directly to epoll like the Python program does.
I suspect there wouldn’t be much of a difference. Once the wall of the OS or the hardware has been hit, there isn’t much more that can be done.
@Dennis: I’ve been over this again and again, and I’m getting very tired of dispelling this myth.
When you run Erlang on their lightweight thread, this is not an OS thread, the “context switching” is managed by their VM.
Guess how “context switching” happens in the Python script? That’s right, it is also managed by the VM, except it is interpreted, which means it should be even slower!
I simply do not understand why there are a few people who refuse to understand this and are very vocal about it. The context-switching overhead for Erlang should be at the slowest, equivalent to the Python interpreter which calls much more than 1 C function per instruction executed. There is no lock contention slowdown for Erlang and Haskell in this case because I disabled SMP.
And finally, I don’t know how many times I have to say this, but this benchmark is about servers.
If you have a server that uses 10 cores and is slower than a server that uses 1 core, which one are you going to pick?
This benchmark doesn’t show the limiting factor to be the system calls. If it was, Erlang would at least bottleneck at the same amount of requests/second with Kernel polling enabled since it also uses epoll.
i don’t get it. Ok, python’s FFI works. Your program(scheduler?) looks like any C, C++ code snippet out there, used for the last 20 years (with bsd select instead of poll) to explain beginners how to use sockets,epoll,select and the like. Ok, it’s not C++, it’s python. However, what does that show? Ok, that Haskell is a really nice programming language. But wait, the Haskell program is a totally different beast, yielding the same “Nothing” when used wise (see Don Stewarts comments) more faster then the python thing. and what?
I would suggest: write the Haskell piece again using the same FFIs and the same imperative do monad scheduler(what’s scheduled here? what the OS is telling, giving to us; in the same order the system is giving that to us. Wow! call it a soldier, not a scheduler, or even multiplexer, pah!) until the Haskell program looks much more similar to THE WORLDWIDE WELLKNOWN socket/accept/poll loop you are so proud of, generating tons of lazy calls, ignoring “onInput :: () -> ()”
@admin
I wanted to read this article to *learn* something.
However, after reading this article, and your rather antagonistic comments, you come across as a stone wall of arrogance rather than a font of knowlege. Did you take this ‘benchmark’ project on in order to learn something, or rather just to pick a fight to up your ’street cred’.
After reading the comments I wanted to learn more about who you are, and to find out if your credentials can backup your arrogance…
Alas, you:
- Hide behind the ‘admin’ name. A rather unusual choice for the blog poster himself to be the ‘anonymous coward’.
- Provide no information about yourself or your experience in performing benchmarks.
- Your codexon.com domain lives behind a secret DNS registration that was only registered 8 months ago.
- You tend to write controversial articles about US Taxes, browser benchmarks, and how to game the system to gain reputation points on StackOverflow. ( http://bit.ly/ckhCps )
- Several of your articles claim to ‘debunk’ something.
Since you decline to provide any bona fides or reasons why I should assign you any credibility at all I will choose not to do so. You shall soon be forgotten as just another benchmark poser looking for a quick fix on digg.com, reddit, stackoverflow, etc.
Too bad.
I do not present credentials because I prefer that the content speaks for itself. The benchmarks and source code are always available for you to verify.
Would these facts be any less correct if I said I did not even graduate from high school? No. Would these facts be any more correct if I said I had a PhD from Harvard? No.
You are a shining example why private DNS registration is free from 1and1. And so what if this domain was only registered 8 months ago. Are you saying that the older your domain is, the more people should believe whatever you say?
If you choose to ignore a reproducible benchmark, that is your loss and reflects on your own laziness and arrogance. I do not feel the need to reveal my personal information just because you cannot be bothered to reproduce it. I can tell you right now that no one who has downloaded the source code has refuted the results.
And about my tone, it’s true it isn’t always courteous. But what do you expect when most of the people complaining such as yourself are far more argumentative, and saying things like “I didn’t do X” when it says “I did do X”. The only reason you didn’t call them out for being arrogant is because you are an Erlang fanboy yourself. You are hardly unbiased.
Egads. Anyone doing any research on Elrang online will quickly realize that it doesn’t do benchmarks very well. It sits within a battleship of a VM with oodles of overhead written in C to do very generalized message passing between lightweight distributed node computers on vast networks. It’s simply a cost effective way for Ericsson to run and update their networks. Go into the VM source code and take out the stuff you don’t need for a web server and then you might start seeing some speed. The only slower languages are like PHP and javascript. Haskell is made to be sublimely academic. It was made to outdo Lisp as a meta language for proving math theorems and PhD CS problems. No one is selling it as a web server solution. Finally, you don’t usually roll your own web servers. Apache wasn’t built in a day, so in general you need to see what language you want to use on top of the web server, not as the web server. Where erlang is being sold is as a unix-like “been there, done that” vm that has solved a lot of real world problems already so that you don’t have to make the same mistakes all over again. This is mainly for proprietary networks/protocols, not the Internet. What would be a more interesting comparison for you would be Python vs Lua, which really is being touted as fast.
Ok, this is a bit ad hominem, but if you take on the quest, it shall give us more… It should also be mentioned that some humour is required, sadly.
What have you done with Erlang in real-life? I don’t say that you evade this question, but it seems that you could be a former “Erlang blogbaboonfanboy” yourself.
Does Erlang suck at everything you do with it?
I’m a CS major myself and I could not write such a Python set. In my uni the “CS” is more about information system building and engineering instead of the tightest loop and the slimmest select. Surely those companies have used time and money to investigate the available languages/libraries before doing such a drastic step to adopt yet another language that barely no one uses professionally compared to Python, Java and C++. I’d say, that is the biggest NO-NO currently in Erlang. Those ass-stenching micro-benchmarks mean nothing in the management level. See Bjarne Däcker’s paper on what Ericsson Radio Ab did to Erlang. (Yeah, you know it already. It never died and its’ open sourcing gave you a cancer in the costume of a messiah. Cancer as in a crab.)
The thing I have learned from the blogosphere is that one should stay extremely critical. You are creating a double-fool of yourself here. You debunked yourself because you were a fool by believing the fanboy blogs and you debunked the management of those mentioned companies. How foolish is that? Perhaps you are a polymath triple degree manager of a multi-million corporation using Erlang for something it shouldn’t, but since the boulder has started rolling you cannot reverse the outcome: your company is going to suffer, because the ugly truth presented in this blog.
How about that. Looks like I wrote a letter!