If you're attempting a serious attack on profiling you ought to include some time looking at how to make profiling work in the face of long running primitives. The profiles we get when using TimeProfiler et al can be quite innaccurate if the code being profiled uses prims that take a long time; the process that interrupts to sample the subject code cannot actually interrupt inside such a long prim.
I forget how it was tackled but VW has some code to try to improve the value of the results, or used to. It is possible that later changes have obviated the problem or changed it beyond obvious similarity.
tim