[lttng-dev] CTF stress tests

Fri Nov 21 09:54:24 EST 2014

Great points!

I like the view: this is like the acid test, since that test is not run
at every build, but rather periodically. That is much more reasonable.

The reason for my initial email is that when I ran the stress test on
the parser, 147 tests failed. When I tweaked the timeouts and the stack
and heap size, it went down to 7. As it is a stress test, it had
repetitive data and was only testing narrow code paths.

The main point I got from the reply is that the program should be clear
in why it failed, not just freeze or disappear. We are probably all in
agreement on that. Efforts will be made on that front.

I still don't see this being run as a test on a per-patch basis, but
rather a periodic activity on the weekend, lest our builds take 4-5
hours instead of 20 min.

On 14-11-20 02:27 PM, Jérémie Galarneau wrote:
> CC-ing lttng-dev since this applies to Babeltrace.
>
> On Tue, Nov 18, 2014 at 9:50 AM, Matthew Khouzam
> <matthew.khouzam at ericsson.com> wrote:
>> Hi,
>> I was looking into the CTF stress tests.
>> They are good, but I don't know if we want them in our standard test
>> cases. They basically check the scalability of the computer it is being
>> run on in most cases and in all cases, the test of reading 12g of files
> I respectfully disagree.
>
> In most tests, there is definitely a smart way to deal with the trace.
> That is not to say that all implementations are expected to support
> every imaginable test cases. However, they should strive to, at the
> very least, gracefully handle failures to read traces and document their
> limitations as the spec imposes very few.
>
> While I agree that some of these tests are far-fetched (you're not likely
> to see a CTF trace with a 32 768 level deep structure any time soon),
> traces with enough streams to exhaust the max fd count are not
> far-fetched at all. In fact, tracing any decently-sized cluster will bust
> that limit in no time.
>
> Handling a multi-megabyte sequence (~100mb), something that
> Babeltrace can't do at the moment, may seem unreasonable at first.
> It quickly becomes very pertinent when people start talking of core
> dumping to a CTF trace.
>
>> is rather prohibitive. That being said, maybe having this as a weekly
>> build could be an interesting idea. Also, I don't think the idea that
>> busting the heap size is a good indication of a test failure. I can
>> always allocate more heap at the problem. :)
> Unfortunately, my motherboard's DIMM slots are already full. ;-)
>
>> Now for a breakdown of the tests:
>> ├── gen.sh
>> ├── metadata
>> │   └── pass
>> │       ├── large-metadata - interesting test, I think something like
>> this should be used to improve robustness
>> │       ├── long-identifier - this should be our approach here, I think
>> http://stackoverflow.com/questions/6007568/what-is-max-length-for-an-c-c-identifier-on-common-build-systems
> If you mean that an implementation should document its limitations,
> I agree.
+1
>
>> │       ├── many-callsites - works here until we oome but it's slow to
>> test and we don't have a real application to validate the data yet. So
>> it's good, but we may wish to put efforts elsewhere for now.
>> │       ├── many-stream-class - This test is looking at the max size of
>> an array...
> Not sure what you mean here although I agree that reading a trace with
> 16 million stream classes is not a "realistic" test.
>
> That's not the point of the CTF stress test suite. This test suite will
> expose errors that are not handled gracefully.
I am guessing this is the main issue, and should be addressed first.
Thank you for highlighting it.

>  Implementations are
> expected to fail, but to do so in a controlled and documented way;
> not silently or by triggering the OOM-killer.
>
>> │       ├── many-typealias - interesting test, the parser will suffer.
>> │       └── many-typedef - ditto
>> └── stream
>>     └── pass
>>         ├── array-large - Testing the max size of ram and arrays.
> There is no need to load the entire array in RAM. I, personally, intend
> on implementing a slow fallback to disk when arrays exceed a given
> size.
>
> This will not work in live mode and that's completely okay. As long
> as it doesn't outright crash.
>
> There is also a security aspect to this; an implementation shouldn't
> trust the relay daemon to send packets it can handle safely. This
> doesn't really apply to arrays as their length is statically known,
> but unchecked sequence length are definitely a security/stability
> concern.
>
>>         ├── many-events - This can be an interesting tool for profiling.
>> But it's not really a test for the reader...
> Not sure why. Tracing multiple UST applications over a long time is
> bound to stress this at some point.
>
> Just imagine a snapshot session running for a year, through multiple
> application updates. The number of UST tracepoints can quickly
> become humongous.
>
>>         ├── many-packets - This will be good to test distributed indexes
>>         ├── many-streams - Good, but even better is...
>>         ├── many-traces - interesting and a real problem
>>         ├── packet-large - this is actually easier than many-events
> Speak for yourself! Babeltrace mmaps entire packets! ;-)
> We have our own share of embarrassing limitations :-P
>
>>         ├── sequence-large - see array-large
>>         ├── string-large - see array-large
>>         ├── struct-many-fields - see array-large
>>         ├── struct-nest-n-deep - I like this one, it highlights one of
>> our optimisations
> Hmm, interesting! Care to explain the gist of it?
We have a struct flatenner, if a struct only contains structs, the scope
is maintained with a separator and the value are brought up a level.
This means that your nest-n-deep tests are always of a depth or 1 for
the reader. Please ping me for more details. :)

Reading the struct still does awful things to the stack though.
Migrating to Antlr4 and Java 8 would help that probably.
>>         └── variant-many-tags - see array-large
> Again, this is something I can easily see being produced by a
> dynamically-typed language. Imagine a VM implementation that
> would trace every function call for every possible argument type
> permutation.
>
>> Why no deep variants? just curious.
> Good point, we should add that!
> We should also have nested sequences, arrays and mixes
> of variants, sequences and arrays.
>
> Our assumption is that the nesting-handling code is shared
> between various types; the nested-structures test should ideally
> trigger it.
>
> Perhaps we should not presume such implementation details...
> The test suite is still a work in progress; any weird test case
> you can think of is welcome!
A deep nested something that has an int(1 bit) at every level
A deep nested something that has a sequence at every odd level and it's
size at every even level.
>> so, tl;dr the tests are good at making a system suffer and qualifying
>> its boundaries. I think this is great for profiling, but for actual
>> testing, it should not be part of the per-patch system we have set up.
>>
>> Any thoughts?
> I think the real problem is assuming that a failing test is necessarily
> a deal-breaker.
>
> I see this test suite as CTF's ACID equivalent [1]. Most web browsers
> knowingly fail this test, it doesn't make it any less relevant.
>
> Jérémie
>
> [1] http://www.acidtests.org/
>