Profiling Proto

Profiling an application is a useful tool for analyzing an application to determine where it's spending most of its processing time.  As new functionality is added to the Proto suite of programs (compilers and simulators), profiling data can lead a programmer to inefficiencies in the system and perhaps the right spot to optimize the code.  Optimizing code without the detailed program performance knowledge gained from profiling can lead to wasted effort optimizing code that won't provide any noticeable performance benefit.

The current Proto paleo and neo compilers, and the MIT Proto simulator and virtual machine are all written in C, and compiled with the gnu suite of tools.  The GNU profiler (gprof) can thus be used to provide profiling data.

One solid resource for understanding and working with gprof is available at http://www.cs.utah.edu/dept/old/texinfo/as/gprof_toc.html.

To compiling PROTO with profiling enabled, you need to add the -pg flag to the C flags.  This is done through the configure script.  Add the following to your configure command:

./configure [other config flags go here] CFLAGS="-g -pg -O3" CXXFLAGS="-g -pg -O3"

The -O3 is optional, you can use whatever optimization level you wish.  Lower optimization levels are useful if you're running the debugger, higher levels are useful for performance

To run Proto with profiling enabled, you don't have to do anything special! Simply run Proto as you normally would.  It will run a little slower due to the profiling system, but no new flags or options are required.

proto "(gradient2)"

In addition to any normal program output, the profiling enabled application will produce a file named "gmon.out" in the current working directory.  When the program exits, you then run the gprof profiler, which will read in that file, process it, and display a summary report.  The two main arguments for gprof are the executable (proto.exe), and an output file to pipe the summary report to.

gprof /usr/local/bin/proto > protoprof.txt

Then example that output file using your favorite text editor (which should be Emacs!  :)

The profiling output contains a text description of what it is showing you, however that description is helpfully AFTER each section that it's describing, not before.  So the top of the file provides the "Flat profile", which should look like this:

 

Flat profile:

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total          

 time   seconds   seconds    calls  us/call  us/call  name   

 28.57      0.02     0.02    71900     0.28     0.28  post_data_to2(char*, Data, int)

 14.29      0.03     0.01    77200     0.13     0.13  DebugDevice::visualize()

 14.29      0.04     0.01     4466     2.24     2.24  CompilationElement::inherit_attributes(CompilationElement*)

 14.29      0.05     0.01                             _fu277___ZNSs4_Rep20_S_empty_rep_storageE

 14.29      0.06     0.01                             _fu48__palette

 14.29      0.07     0.01                             glEnd

  0.00      0.07     0.00   847330     0.00     0.00  Data::reset()

  0.00      0.07     0.00   188249     0.00     0.00  std::vector<Data, std::allocator<Data> >::push_back(Data const&)

  0.00      0.07     0.00   154400     0.00     0.00  SimpleBody::display_radius()

  0.00      0.07     0.00    79200     0.00     0.00  SimpleBody::position()

  0.00      0.07     0.00    77200     0.00     0.00  SimpleBody::visualize()

  0.00      0.07     0.00    77200     0.00     0.00  DeviceLayer::visualize()

  0.00      0.07     0.00    77200     0.00     0.00  UnitDiscDevice::visualize()

  0.00      0.07     0.00    77200     0.00     0.00  Device::text_scale()

  0.00      0.07     0.00    77200     0.00     0.13  Device::visualize()

  ...

The description of what the Flat Profile is showing you is at the end of the Flat Profile.  Real simply though, it's showing you what functions are taking up the most CPU time.  The leftmost column is the percentage of time the program spends in the function given in the rightmost column.  The list is sorted by the first column, so the first entry is the function that takes the most processing time.  You're looking usually for large outliers, a few functions at the top that are taking a disproportionate amount of time.  Those are the candidates for optimization and refactoring.

In the example above, we're seeing mostly graphics methods.  Most of the time is spend drawing things on the screen.  There are some functions in the middle (Data::reset) that don't look like graphic methods.  To clear things up, let's run Proto in headless mode, so we don't have any graphics processing to do.

./proto "(gradient2)" -headless -stop-after 200

We have to provide a stopping time when we go headless, that's the stop-after argument.

Another profiling cleanup that can be done is to eliminate the compiler.  The standard proto program will read in your proto program, compile it, emit the byte code and then run it through the simulator.  The profiling process will capture this entire process.  Generally, compiling is fast enough that it won't usually dominate the time.  You can get rid of this step via the -opcodes option.  This option will let proto take a file of byte code, effectively skipping the compilation and emission steps.  You can get the bytecode for your program with the --instructions argument.

So, run proto with --instructions, cut-and-paste the resulting bytecodes into a file, then run proto with -opcodes [bytecode filename].

I usually skip that step, since compilation and emission don't usually dominate the simulator for processing time.

Here's an example of profiling output for the same gradient program in headless mode.

 

Flat profile:

Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total          

 time   seconds   seconds    calls  ms/call  ms/call  name   

 22.58      0.07     0.07 26069428     0.00     0.00  Data::reset()

 12.90      0.11     0.04  5786750     0.00     0.00  std::vector<Data, std::allocator<Data> >::push_back(Data const&)

  9.68      0.14     0.03                             BasicStack<Data>::pop() [clone .clone.67]

  6.45      0.16     0.02   473272     0.00     0.00  Instructions::NBR_RANGE(Machine&)

  6.45      0.18     0.02                             (anonymous namespace)::compare(Data const&, Data const&)

  3.23      0.19     0.01   493274     0.00     0.00  void Instructions::REF_N<0>(Machine&)

  3.23      0.20     0.01   473372     0.00     0.00  Instructions::RET(Machine&)

  3.23      0.21     0.01   276640     0.00     0.00  Instructions::EQ(Machine&)

  3.23      0.22     0.01   196632     0.00     0.00  HoodInstructions::fold_hood_plus_step_fuse(Machine&)

  3.23      0.23     0.01   196632     0.00     0.00  HoodInstructions::fold_hood_plus_step_filter(Machine&)

  3.23      0.24     0.01    40004     0.00     0.00  Instructions::FOLD_HOOD_PLUS(Machine&)

  3.23      0.25     0.01    20002     0.00     0.00  Instructions::INIT_FEEDBACK(Machine&)

  3.23      0.26     0.01    20002     0.00     0.00  Machine::run_callback(Machine&)

  3.23      0.27     0.01    19961     0.00     0.00  UnitDiscRadio::radio_send_export(unsigned char, Array<Data> const&)

--

We can see that a large portion of time is spent on operations on the Data class, especially the Reset method.  We can look at the second section of profiling output, the call graph section to find out where these methods are being called.  As before the description of what the call graph section is showing is presented after the call graph section.  The text right before the call graph section is the description of the preceding Flat Profile.

 

Call graph section

...

                0.00    0.00 1613160/26069428     HoodInstructions::fold_hood_plus_step_fuse(Machine&) [9]

                0.01    0.00 2369748/26069428     (anonymous namespace)::compare(Machine&) [25]

                0.01    0.00 3452918/26069428     Instructions::ADD(Machine&) [12]

                0.01    0.00 4620084/26069428     BasicStack<Data>::pop() [clone .clone.67] [5]

                0.02    0.00 5786650/26069428     std::vector<Data, std::allocator<Data> >::push_back(Data const&) [4]

[2]     22.6    0.07    0.00 26069428         Data::reset() [2]

-----------------------------------------------

The line with a number in the far left column is the line corresponding to the function this section of the call graph is for.  Lines above that are for "parent" functions, these are the functions that call our function.  Lines below are for "children", functions that our function calls.  Here we have 26 million calls of Data:reset, most coming from the last few lines, pop, push_back, ADD, compare, and fold_hood_plus_step_fuse.  You can then look further at those sections of the call graph.

Data::reset is a garbage collection method in the new VM. This profiling shows us quickly that Data::reset is a great place to spend time optimizing.  It appears to be called maybe too often, and anything we could do to speed it up would have a noticeable effect on the run time of our simulator.

 

Groups:

Comments

jakebeal

Why is "pop" so costly?

Something odd I noticed in the profiling data:

  9.68      0.14     0.03                             BasicStack<Data>::pop() [clone .clone.67]

What do you think the reason might be for pop() being so expensive?

 

bbenyo

pop() just calls Data:reset()

pop() just calls Data:reset() on the elements being popped, which invokes the garbage collector cleanup code.  I'm guessing reset is getting inlined and the profiler attributing it to pop.  You could experiment with recompiling with inlining off, and see what that does. 

jakebeal

Re: pop() just calls Data:reset()

Ah, that makes sense... I suppose that more than anything else that the VM GC needs to be improved.

I hear that Maurice is planning to work on a VM upgrade, so you should point him to your profiling results if you haven't already.