Slow disk read speed on ESX4

2009 June 11
by Anton Gostev

[UPDATE 01/05/10] Good news! The patch ESX Patch ESX400-200912401-BG improving the disk read speed issue discussed below is now available from VMware, please refer to VMware support KB article 1016291 for more information and download. My quick testing showed about 2 times improvement.

Last week, Veeam has released version 3.1 of Veeam Backup product with support for VMware vSphere. We were the first to provide support for vSphere in VMware backup product, so together with early vSphere adopters, we got to be the first to see what ESX 4.0 is all about.

As soon as first customers upgraded to vSphere after having Veeam Backup 3.1 installed, we have immediately started receiving first complaints on the network backup performance after upgrading ESX hosts to ESX 4.0. Customers using service console agent backup mode observed 2-3 times backup speed reduction! So I had to get my hands dirty and find out what has changed in ESX 4.0 which resulted such a great performance decrease.

Couple of words about my testing setup:

  • ESX host with SAS local storage (couple of modern hard drives in RAID0 configuration for best performance)
  • Debian Linux server with modern hard drive formatted ext2 for best performance (acting as source and target for file copy testing)

My simple test plan included performing speed testing on file copy speed to/from ESX 3.5 host, then installing ESX 4.0 on the same host and performing the same test again. I have used Veeam FastSCP 3.0.1 in the service console agent mode to perform all testing. FastSCP shares the file transfer engine with Veeam Backup, so downloading file from ESX simulates backup activity, whereas uploading file simulates restore. Note that due to Veeam service console backup engine having “direct-to-target” architecture, all network traffic goes straight from ESX host to target Linux server, bypassing Veeam Backup console. Thus, test environment is completely isolated to my test servers, both located on the same network switch.

I have created 5 test files of 4GB in size with randomly generated content, to prevent Veeam Backup engine from doing empty block removal and network traffic compression, thus affecting results. I have used multiple files to ensure that file system cache on Linux server does not affect the testing results.

So, to make the long story short, here’s the results of FastSCP file copy test between Linux server and ESX host.

post1

As you see, disk I/O speed is as expected for both upload or download in case of ESX 3.5 host. Transfer capped around the speed where my Linux server disk read/write performance, and network connection speed starts coming into play. However, with ESX 4.0, download performance was awful. Interestingly, there were no changes in the upload speed – only download seem to be affected by throttling. Just to make sure it is not FastSCP fault, I have also performed all the same tests with VMware Infrastructure Client datastore browser, and got exactly the same results.

To completely remove network stack and FastSCP from equation, I have also verified maximum possible read speed for ESX service console by running time cat > /dev/null command on the test files on both ESX host service console, and here are the results.

post2

As you can see, with ESX 4.0, service console storage read speed looks to be throttled at 25MB/s. At the same time, when performing the exact same test in the Linux VM running on my ESX 4.0 host (flat VMDK file is located on the same storage), the read speed sky rockets to whopping 125 MB/s, indicating that there are no issues with actual storage performance, or ability of ESX host to effectively access it.

I wonder if this is a bug , or undocumented “feature” of ESX 4.0 designed to force people stop using service console agent backups altogether by intentionally making performance unacceptable? I can definitely see how this will affect those VMware backup vendors who do not even have any other options of backing up VMs other than service console agent.

In all cases, I am hoping this will be addressed sooner rather than later by VMware: I can see how this issue will prevent many customers from upgrading to vSphere, and this is definitely not what VMware wants. One of our competitors claims that as many as 50% of their customers are using service console backup, and I tend to agree with these estimates, although based on feedback I have been receiving on Veeam Community Forums it is more like 30-40% in case of Veeam (the rest are using VCB). On a side note, VCB in SAN mode is obviously not affected by this cap, because it provides direct storage access – so as long as you are using VCB, speed should remain fine after upgrade to ESX 4.

Let me know in comments what do you think about this?

[UPDATE 06/11/06] Got a response from John Troyer with VMware:
We have not intentionally crippled service console agents. We’re investigating…

[UPDATE 07/11/09] Been working together with VMware in testing the issue and possible resolutions, will publish more information as soon VMware makes some sort of official statement available.

[UPDATE 07/28/09] VMware published support KB article 1012159 that acknowledges the issue at http://kb.vmware.com/kb/1012159

[UPDATE 01/05/10] VMware made the patch ESX400-200912401-BG available that improves service console read speed about 2 times, refer to support KB article 1016291 at http://kb.vmware.com/kb/1016291 for more information.

19 Responses leave one →
  1. 2009 June 11
    VMdoug permalink

    Thanks for the post, let’s hope this is a bug and not a “feature”

  2. 2009 June 25
    Lester permalink

    re VMdoug’s comment. I’d be interested to see what issues the “Major” backup vendors are having in this area..

  3. 2009 July 3
    Paul permalink

    Any news back from Vmware at all? This is driving me so mad that Im migrating machines back to ESXi 3.5 from vSphere4

  4. 2009 July 3

    Hi Paul, the support case has been going up the chain past 2 weeks, but responses are very slow. So far we’ve been told we are not the only ones reporting the degraded performance, and that this is currently being investigated by VMware engineering department, but there are no ETAs. We requested official statement and got introduced to the corresponding responsible person, but he is not responding past few days either.
    Frankly, I am not very happy with the quality of VMware support, but this is a separate story.

  5. 2009 July 3
    Paul permalink

    Thanks for the update. Its lucky that the company that I work for decided to stay on 3.5. I would hate to have to revert about 20 racks of hp bl series blade servers if we didnt have the SAN snapshotting. As I dont have direct contact with vmware via the company, I have no means to scream and shut… yet. Love the site, keep up the good work :)

  6. 2009 July 5
    Louw Pretorius permalink

    Remember that in vSphere the COS is another VM….

    So I think there might have been a configuration change to “discourage” COS-based backup software in favor of VCB or vDR, but all will be revealed shortly, hopefully by Engineering and not Marketing….

  7. 2009 July 9
    Trevor permalink

    I can confirm this on several systems built to purpose in my lab. The difference between ESxi 3.5 (or VMWare 2 on CentOS 5.3) in terms of transferring VMs up to the host is staggering. Currently the solution seems to be the iSCSI trick: get some target software (or a cheap NAS) and load the VMs onto it, attach as iSCSI storage to the host, and transfer to/from that datastore. Significantly faster than simply “upload” using vSphere, or attempting to use an NFS store as a “bounce point” for transfering VMs around.

  8. 2009 July 27
    NiTRo permalink

    Hi, very interesting reports. Do you get the same bad results between ESXi 3.5 to 4.0 ?

  9. 2009 July 28

    Hi NiTRo, I have tested both ESXi 3.5 and ESXi 4.0 having installed them on the very same host, and did not observe any differences – they were both equally slow (around 20MB download, 8-10 MB/s upload). This is using agentless mode in FastSCP (based NFC API), which is the only possible mode for ESXi because the service console is simply not available.

  10. 2009 August 11
    Paul permalink

    Actually, I don’t believe Veeam was the first to support ESX 4.0/vSphere… We have been using vRangerPro 4.0 for some time before the new version of Veeam was released.

  11. 2009 August 11
    Paul permalink

    NM… sorry, my bad… got the dates messed up.

  12. 2009 September 2
    thorsten permalink

    this issue absolutely sucks big. any news about this? when will a fix be available ?

  13. 2009 October 11

    VMware is working on the fix… recently I have played with test version of fix and it does improve the performance. I am not sure if this stuff is covered by NDA, so I cannot provide more information at this time unfortunately (like performance numbers with fix installed, its general availability and so on).

  14. 2009 October 15
    Ahmed permalink

    Im using VCB + Backup Exec 12.5
    You can safetly say that I lost 50% of my speed when backing up with the new version of VMware ESX 4.0.

    :(
    Bummer.
    Want to downgrade to 3.5 :(

  15. 2009 November 9
    Jason permalink

    I am running v4 and Backup Exec 12.5 also. All resources are local disk direct attached RAID1 volumes.

    The guest was converted from a physical machine onto a new server. Our D2D2T backup used to take 10hrs when physical. (5hrs for D2D and 5hrs for D2T) Now the D2D backup takes 28hrs and the tape backup around 11.5hrs. (yes the tape is faster). The host doesn’t have that much load on it either and everything else is working with acceptable speed.

    Doing some more perf testing and I am waiting for a new RAID controller to arrive in the hope that this improves my problem but I didn’t think my local disk system was too bad.

  16. 2009 November 9

    To be honest, giving the speed we are seeing with Veeam Backup 4.0 on ESX4 (due to leveraging the vStorage API and changed block tracking), this issue became much less critical for our customers. Even though actual data copy is slow, change tracking still makes incremental backups extremely fast (VM processing speed is hundreds MB/s). So really only full backups are affected – which only happen once in Veeam anyway, thanks to synthetic backup.

  17. 2010 January 11

    Updated the article with resolution!

  18. 2010 February 3
    james permalink

    I wasn’t able to download the patch from the vmware site. Anybody else have any luck with this site? http://support.vmware.com/selfsupport/download/

  19. 2010 February 3

    Looks like the whole http://support.vmware.com is down today.