Slow disk read speed on ESX4
[UPDATE 11/11/10] I really should have posted another update earlier, but did not realize this article is still getting hits. Please be aware that the issue is fully resolved in ESX 4.1, and patch is no longer needed.
[UPDATE 01/05/10] Good news! The patch ESX Patch ESX400-200912401-BG improving the disk read speed issue discussed below is now available from VMware, please refer to VMware support KB article 1016291 for more information and download. My quick testing showed about 2 times improvement.
Last week, Veeam has released version 3.1 of Veeam Backup product with support for VMware vSphere. We were the first to provide support for vSphere in VMware backup product, so together with early vSphere adopters, we got to be the first to see what ESX 4.0 is all about.
As soon as first customers upgraded to vSphere after having Veeam Backup 3.1 installed, we have immediately started receiving first complaints on the network backup performance after upgrading ESX hosts to ESX 4.0. Customers using service console agent backup mode observed 2-3 times backup speed reduction! So I had to get my hands dirty and find out what has changed in ESX 4.0 which resulted such a great performance decrease.
Couple of words about my testing setup:
- ESX host with SAS local storage (couple of modern hard drives in RAID0 configuration for best performance)
- Debian Linux server with modern hard drive formatted ext2 for best performance (acting as source and target for file copy testing)
My simple test plan included performing speed testing on file copy speed to/from ESX 3.5 host, then installing ESX 4.0 on the same host and performing the same test again. I have used Veeam FastSCP 3.0.1 in the service console agent mode to perform all testing. FastSCP shares the file transfer engine with Veeam Backup, so downloading file from ESX simulates backup activity, whereas uploading file simulates restore. Note that due to Veeam service console backup engine having “direct-to-target” architecture, all network traffic goes straight from ESX host to target Linux server, bypassing Veeam Backup console. Thus, test environment is completely isolated to my test servers, both located on the same network switch.
I have created 5 test files of 4GB in size with randomly generated content, to prevent Veeam Backup engine from doing empty block removal and network traffic compression, thus affecting results. I have used multiple files to ensure that file system cache on Linux server does not affect the testing results.
So, to make the long story short, here’s the results of FastSCP file copy test between Linux server and ESX host.
As you see, disk I/O speed is as expected for both upload or download in case of ESX 3.5 host. Transfer capped around the speed where my Linux server disk read/write performance, and network connection speed starts coming into play. However, with ESX 4.0, download performance was awful. Interestingly, there were no changes in the upload speed – only download seem to be affected by throttling. Just to make sure it is not FastSCP fault, I have also performed all the same tests with VMware Infrastructure Client datastore browser, and got exactly the same results.
To completely remove network stack and FastSCP from equation, I have also verified maximum possible read speed for ESX service console by running time cat > /dev/null command on the test files on both ESX host service console, and here are the results.
As you can see, with ESX 4.0, service console storage read speed looks to be throttled at 25MB/s. At the same time, when performing the exact same test in the Linux VM running on my ESX 4.0 host (flat VMDK file is located on the same storage), the read speed sky rockets to whopping 125 MB/s, indicating that there are no issues with actual storage performance, or ability of ESX host to effectively access it.
I wonder if this is a bug , or undocumented “feature” of ESX 4.0 designed to force people stop using service console agent backups altogether by intentionally making performance unacceptable? I can definitely see how this will affect those VMware backup vendors who do not even have any other options of backing up VMs other than service console agent.
In all cases, I am hoping this will be addressed sooner rather than later by VMware: I can see how this issue will prevent many customers from upgrading to vSphere, and this is definitely not what VMware wants. One of our competitors claims that as many as 50% of their customers are using service console backup, and I tend to agree with these estimates, although based on feedback I have been receiving on Veeam Community Forums it is more like 30-40% in case of Veeam (the rest are using VCB). On a side note, VCB in SAN mode is obviously not affected by this cap, because it provides direct storage access – so as long as you are using VCB, speed should remain fine after upgrade to ESX 4.
Let me know in comments what do you think about this?
[UPDATE 06/11/06] Got a response from John Troyer with VMware:
We have not intentionally crippled service console agents. We’re investigating…
[UPDATE 07/11/09] Been working together with VMware in testing the issue and possible resolutions, will publish more information as soon VMware makes some sort of official statement available.
[UPDATE 07/28/09] VMware published support KB article 1012159 that acknowledges the issue at http://kb.vmware.com/kb/1012159
[UPDATE 01/05/10] VMware made the patch ESX400-200912401-BG available that improves service console read speed about 2 times, refer to support KB article 1016291 at http://kb.vmware.com/kb/1016291 for more information.