Comparing LINQ’s ToList and ToArray Extension Methods

Introduction

When working with an expression that results in a lazily evaluated sequence, it’s common to want to cache the results of that expression. This is often because there’s some large performance penalty associated with evaluating the items in the sequence (e.g. the sequence is a result of some database query). If the items must be iterated over more than once, holding the items in memory can give a speed boost, usually at the cost of some memory.

LINQ’s extension methods on IEnumerable<T> make this easy. Two methods are commonly used – ToList and ToArray. Many times, calls to one of these methods can easily be substituted for a call to the other, especially where the resulting collection is only being read. This article compares the two methods.

The Code

NOTE: Most LINQ code contains optimizations for collections which we already know the size for, i.e. ICollection<T>s. I’m going to ignore these, as when working with examples like the one mentioned in the introduction that’s rarely the case.

Both extension methods are defined in the System.Core assembly, in the System.Linq.Enumerable type. ToList is pretty simple, as List<T> already has a constructor which takes an IEnumerable<T>. The implementation simply calls that constructor. Under the hood, List<T> uses an array to hold its items. If you look through the decompiled code, you’ll see the first array starts out with a capacity of 4. Whenever the underlying array is not big enough, a new array is created with double the length. Therefore, if we call a constructor with a sequence which will evaluate to 15 items, three arrays will be created under the hood – a T[4], a T[8], and a T[16]. The 4 and 8 length arrays will be dereferenced and GC’d. There shouldn’t be any surprises there if you know anything about the internals of List<T>.

ToArray, it turns out, is a little different. ToArray uses the internal Buffer<T> class. Buffer<T> basically uses the same array expansion policy that List<T> uses (so much for DRY). However, there’s one interesting difference. A List<T> hides its underlying array, meaning that it can present its own item count. This count is not necessarily the same as the array length. It’s perfectly valid to have a List<T> containing 11 items with an underlying T[16] (the underlying array’s size is exposed via the Capacity property). Because ToArray returns an array, the array must be of the correct length. It’d be a little weird to call ToArray on a sequence yielding 5 items and get back a T[8] containing three default(T)s. Therefore, Buffer<T> uses an extra step to create an array of the correct size, and return that one. If we call ToArray on a sequence that yields 15 items, four arrays are created – T[4], T[8], T[16] and T[15]. The first three are dereferenced and will be GC’d. Of course, if the sequence yields 16 items (or any length which is a power of 2), no trimming step is needed.

Practical

So we’ve seen that ToArray should perform worse than ToList for sequences yielding a number of items that is not a power of 2. Let’s check this theory with a little F#. Note here that each test is run 100 times.

  1. open System
  2. open System.Collections.Generic
  3. open System.Diagnostics
  4. open System.Linq
  5.  
  6. let seqLength = 1
  7.  
  8. let items = seq { for i = 1 to seqLength do yield i }
  9.  
  10. let runTest (name, f) =
  11.     let sw = Stopwatch()
  12.     sw.Start()
  13.     for i = 1 to 100 do f()
  14.     sw.Stop()
  15.     sw.ElapsedMilliseconds
  16.  
  17. let tests = [| "ToList", (fun () -> items.ToList() |> ignore)
  18.                "ToArray", (fun () -> items.ToArray() |> ignore) |]
  19.  
  20. tests
  21. |> Seq.map (fun (name, exec) -> name, runTest(name, exec))
  22. |> Seq.iter (fun (name, timeMs) -> printfn "%8s  %Oms" name timeMs)

Here we have a seqeunce that will yield seqLength items. Running this for sequences with lengths of 1, 10, 100, 1,000, and 10,000 don’t appear to reveal a great deal of differences on my laptop, however, once we get to 100,000 and 1,000,000 items ToList starts to beat ToArray. The results for 100,000 are around the following:

  1.   ToList  2064ms
  2.  ToArray  2288ms

That’s a difference of 2.24ms per operation. For 1,000,000:

  1.   ToList  18228ms
  2.  ToArray  19024ms

With a larger sequence, the difference per operation is around 7.96ms. So we see that in terms of processing time, ToList begins to outperform ToArray when sequence sizes get to 100,000 items.

Let’s now try with a sequence of 1,048,576 items, which is a power of 2.

  1.   ToList  18338ms
  2.  ToArray  18060ms

That’s interesting. Now ToArray is faster than ToList. This can be put down to the fact that ToArray now doesn’t need a trimming step, and the fact that Lists have more overhead than an array.

Memory-wise, ToArray can be expected to use more memory while it runs because of the trim step, but this will change if no trim is needed. However, the result of ToArray always consumes less memory than the result of ToList.

Conclusion

So which is best? I’d previously assumed that ToArray would always be faster, but with very large sequences this turns out to be false. ToList is perhaps a slightly better choice for huge sequences. Of course, all this information needs to be put into persepective. The difference between the two is almost imperceptible – and that’s for a sequence generated in memory. Database queries resulting in similar result sizes I expect would take much longer to yield, dwarfing the difference between ToArray and ToList even further.

For me, these results simply mean that I will no longer view ToArray as a lightweight option compared to ToList, and perhaps focus more on the features I want from the resulting collection rather than the performance considerations of obtaining that collection.

Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks
  • email
  • LinkedIn
  • Technorati