[lttng-dev] Trigger snapshots on a watchdog

Fri Sep 13 05:51:02 EDT 2024

Hi Damien,

I've added a very summaryfeature request issue here[1], referring to 
this discussion. If you would like to elaborate or add other details, 
that would be most excellent.

thanks,

kienan

[1]: https://bugs.lttng.org/issues/1416

On 2024-09-12 12:14, Damien Berget wrote:
> Thanks for the quick response Kienan,
> Your proposal is exactly how we were thinking the monitor application 
> could work, so we'll go with that for now.
> Reacting to absence of an event (watch dog) would really be a good 
> complement to the existing trigger types.
> It's a really useful feature for a flight recorder in embedded medium 
> real-time applications, is the team open to feature requests?
> Cheers
> Damien
>
> On Thu, Sep 12, 2024 at 12:57 AM Kienan Stewart 
> <kstewart at efficios.com> wrote:
>
>     Hi Damien,
>
>     On 2024-09-11 18:38, Damien Berget via lttng-dev wrote:
>     > Good day,
>     > We are trying to see what it the best way to monitor some
>     applications
>     > not hitting a deadline. Ideally something like a watchdog that
>     needs
>     > to be pat regularly and if timeout is reached triggers the snapshot.
>     >
>     > Before we reinvent the wheel and code some userland
>     applications, is
>     > there a canonical way in LTTng to do it? I found this
>     > <https://review.lttng.org/c/lttng-tools/+/9657/9> that is
>     suspiciously
>     > close maybe?
>     >
>     I don't think the the proposed changes you linked to are useful or
>     related to what you hope to achieve. The patch series is a concept
>     about
>     how some types of UST ring buffer stalls might be addressed by the
>     session daemon. After a quick glance, the monitoring seems to be more
>     closely related to the 'monitor timer', which is used to sample
>     statistical information channels[1].
>
>
>     There is a concept of triggers[2]; however triggers react to the
>     presence of events rather than the absence thereof.
>
>
>     I think a small user space application that monitors the state of
>     other
>     applications is more the direction to head in. There's at least of
>     couple of ways that a snapshot on unhealthy state could be achieved:
>
>
>     * Use liblttng-ctl to trigger a snapshot from your watchdog
>     application[3][4].
>
>     * Have the watchdog application exec `lttng snapshot record`[5].
>
>     * Have the watchdog application emit some sort of "health state"
>     events
>     with some data (e.g. health_okay, health_bad, ...) per your usage
>     requirements, and configure a trigger[2] to take a snapshot on the
>     "health state" events that have the non-okay state.
>
>
>     Depending on your tracing configuration - channel overwrite/discard
>     mode[6], buffer sizes, blocking mode, and number of events it is
>     possible that events may not be recorded. I would privilege using
>     liblttng-ctl or exec'ing `lttng snapshort record` if you want a
>     stronger
>     guarantee that your watchdog will cause a snapshot to be taken.
>
>
>     I would love to hear if there are other ideas. Regardless, hope
>     this helps!
>
>
>     thanks,
>
>     kienan
>
>
>     [1]: https://lttng.org/docs/v2.13/#doc-channel-timers
>
>     [2]: https://lttng.org/docs/v2.13/#doc-trigger
>
>     [3]: https://lttng.org/docs/v2.13/#doc-liblttng-ctl-lttng
>
>     [4]:
>     https://github.com/lttng/lttng-tools/tree/master/src/lib/lttng-ctl
>
>     [5]: https://lttng.org/man/1/lttng-snapshot/v2.13/
>
>     [6]:
>     https://lttng.org/docs/v2.13/#doc-channel-overwrite-mode-vs-discard-mode
>
>
>     > Thanks,
>     > Cheers
>     >
>     > --
>     > *Damien Berget*
>     > Embedded Platform Lead
>     > damien.berget at flyzipline.com
>     >
>     > _______________________________________________
>     > lttng-dev mailing list
>     > lttng-dev at lists.lttng.org
>     > https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
>
>
>
> -- 
> *Damien Berget*