[lttng-dev] 回复: lttng-consumerd can NOT get notification to consume ring buffer data
zhenyu.ren
zhenyu.ren at aliyun.com
Tue Dec 7 20:47:01 EST 2021
Hi, Desnoyers
For various reasons, none of the three suggestions you mentioned about reproducing the problem can be achieved easily.
I am also wonder why commit counter not updated...It seems that the span producers not complete commiting(be killed?hang?)? It‘s very hard to debug all span producers since they are too many...
So, Can I make lttng-consumer just skip out this subbuffer and try to consume the next subbuffer?
Thanks
zhenyu.ren
------------------------------------------------------------------
发件人:Mathieu Desnoyers <mathieu.desnoyers at efficios.com>
发送时间:2021年12月8日(星期三) 04:17
收件人:zhenyu.ren <zhenyu.ren at aliyun.com>
抄 送:lttng-dev <lttng-dev at lists.lttng.org>
主 题:Re: [lttng-dev] lttng-consumerd can NOT get notification to consume ring buffer data
Hi,
Can you try:
- Reproducing with a more recent LTTng-UST/LTTng-Tools ? (e.g. 2.13). LTTng 2.7 is not supported anymore.
- Try to reproduce with "per-pid" UST buffers rather than "per-uid",
- Try to reproduce without the "tracefile rotation" mode (without --tracefile-count and --tracefile-size options to the channel).
Also, do you happen to have traced applications which are frequently killed asynchronously
while some of their threads are actively tracing by any chance ?
Thanks,
Mathieu
----- On Dec 7, 2021, at 4:03 AM, lttng-dev <lttng-dev at lists.lttng.org> wrote:
Hi, Lttng-dev:
I found a strange problem related to lttng-consumed and ctf files recently. The ctf files belongs to some CPUs have been stopped rotating but other ctf files(belong to other CPUs) keeped working as usual. I am very sure all CPUs were producing spans all the time.
#date; ls -ltrh channel0_0_* |tail -n 3; date;ls -ltrh channel0_1_* |tail -n 3
Tue Dec 7 16:25:45 CST 2021
-rw-rw---- 1 root root 1.8M Dec 7 16:20 channel0_0_17
-rw-rw---- 1 root root 2.0M Dec 7 16:23 channel0_0_18
-rw-rw---- 1 root root 916K Dec 7 16:24 channel0_0_19
Tue Dec 7 16:25:45 CST 2021
-rw-rw---- 1 root root 1.7M Dec 6 00:30 channel0_1_8
-rw-rw---- 1 root root 1.9M Dec 6 00:31 channel0_1_9
-rw-rw---- 1 root root 388K Dec 6 00:32 channel0_1_10
Notice that the ctf files with CPU0 (channel0_0_19) was modified at the time "now", but the ctf files with CPU1(channeo0_1_10) has been stopped working at Dec 6.
I gdb lttng-consumerd(break at lib_ring_buffer_channel_do_read() and lib_ring_buffer_poll_deliver()--I configured a read timer on channels). See the followings (lib_ring_buffer_poll_deliver() )
0x00007f0f2418a445 <+213>: mov 0x0(%r13),%rcx rcx:0x7f0f0402e610(handle->table)
0x00007f0f2418a449 <+217>: mov 0x98(%rsi),%r9 rsi:0x7f0efb86f000(buf) r9:0x86c400000(consumed_old = buf->consumed)
0x00007f0f2418a450 <+224>: lea 0x150(%rsi),%rdi rdi:0x7f0efda75150
0x00007f0f2418a457 <+231>: mov 0x150(%rsi),%rax rax:3(cpu?)
0x00007f0f2418a45e <+238>: mov 0x78(%rbp),%rdx rbp:chan rdx:0x1000000(chan->backend.buf_size)
0x00007f0f2418a462 <+242>: mov 0x88(%rbp),%r8d r8d:0x14(20)(chan->backend.buf_size_order)
0x00007f0f2418a469 <+249>: cmp %rax,0x8(%rcx)
0x00007f0f2418a46d <+253>: jbe 0x7f0f2418a80c <lib_ring_buffer_channel_do_read+1180>
0x00007f0f2418a473 <+259>: shl $0x6,%rax rax:192
0x00007f0f2418a477 <+263>: sub $0x1,%rdx rdx:0xffffff 16777215 (chan->backend.buf_size - 1)
0x00007f0f2418a47b <+267>: lea 0x10(%rax,%rcx,1),%r10 rax:192 rcx:0x7f0f0402e610(handle->table) r10:0x7f0f0402e6e0
0x00007f0f2418a480 <+272>: and %r9,%rdx r9:0x8fdc00000 rdx:0xc00000 12582912 buf_trunc(consume_old, chan->backend.buf_size - 1)
0x00007f0f2418a483 <+275>: mov 0x8(%rdi),%rax rdi:0x7f0efda75150 rax:2688 ref_offset = (size_t) ref->offset
0x00007f0f2418a487 <+279>: mov %r8d,%ecx ecx:0x14(20)
0x00007f0f2418a48a <+282>: shr %cl,%rdx (....)
0x00007f0f2418a48d <+285>: shl $0x7,%rdx
0x00007f0f2418a491 <+289>: add %rax,%rdx
0x00007f0f2418a494 <+292>: lea 0x80(%rdx),%rax
0x00007f0f2418a49b <+299>: cmp 0x28(%r10),%rax
0x00007f0f2418a49f <+303>: ja 0x7f0f2418a80c <lib_ring_buffer_channel_do_read+1180>
0x00007f0f2418a4a5 <+309>: mov 0x20(%r10),%rax r10:0x7f0f0402e6e0 rax:0x7f0efb86f000(buf)
0x00007f0f2418a4a9 <+313>: add %rdx,%rax rdx:3200 rax:0x7f0efb86fc80
0x00007f0f2418a4ac <+316>: mov (%rax),%rax rax:0x86c00000(commit_count!!!!!!Incremented _once_ at sb switch cc_sb)
0x00007f0f2418a4af <+319>: mov 0x80(%rbp),%r8 r8:0x100000(chan->backend.subbuf_size)
0x00007f0f2418a4b6 <+326>: mov 0x78(%rbp),%rdx rdx:0x1000000(chan->backend.buf_size)
0x00007f0f2418a4ba <+330>: mov 0x8c(%rbp),%ecx ecx:4
0x00007f0f2418a4c0 <+336>: mov 0x80(%rsi),%rdi rdi:0x86d400000
0x00007f0f2418a4c7 <+343>: sub %r8,%rax rax:0x86b00000(commit_count - chan->backend.subbuf_size)
0x00007f0f2418a4ca <+346>: and 0x8(%rbp),%rax rax & chan->commit_count_mask = rax:0x86b00000
0x00007f0f2418a4ce <+350>: neg %rdx rdx:0x1000000(16M chan->backend.buf_size) --> 0xffffffffff000000(-16777216)
0x00007f0f2418a4d1 <+353>: and %r9,%rdx r9:0x86c400000(consume_old) rdx:0x86c000000
0x00007f0f2418a4d4 <+356>: shr %cl,%rdx cl:4 rdx:0x86c00000
0x00007f0f2418a4d7 <+359>: cmp %rdx,%rax rax:0x86b0000 rdx:0x86c00000
0x00007f0f2418a4da <+362>: jne 0x7f0f2418a411 <lib_ring_buffer_channel_do_read+161>
0x00007f0f2418a4e0 <+368>: mov %r8,%rax
209 static inline
210 int lib_ring_buffer_poll_deliver(const struct lttng_ust_lib_ring_buffer_config *config,
211 struct lttng_ust_lib_ring_buffer *buf,
212 struct channel *chan,
213 struct lttng_ust_shm_handle *handle)
214 {
215 unsigned long consumed_old, consumed_idx, commit_count, write_offset;
216
217 consumed_old = uatomic_read(&buf->consumed);
218 consumed_idx = subbuf_index(consumed_old, chan);
219 commit_count = v_read(config, &shmp_index(handle, buf->commit_cold, consumed_idx)->cc_sb);
220 /*
221 * No memory barrier here, since we are only interested
222 * in a statistically correct polling result. The next poll will
223 * get the data is we are racing. The mb() that ensures correct
224 * memory order is in get_subbuf.
225 */
226 write_offset = v_read(config, &buf->offset);
227
228 /*
229 * Check that the subbuffer we are trying to consume has been
230 * already fully committed.
231 */
232
233 if (((commit_count - chan->backend.subbuf_size)
234 & chan->commit_count_mask)
235 - (buf_trunc(consumed_old, chan)
236 >> chan->backend.num_subbuf_order)
237 != 0)
238 return 0;
239
240 /*
241 * Check that we are not about to read the same subbuffer in
242 * which the writer head is.
243 */
244 if (subbuf_trunc(write_offset, chan) - subbuf_trunc(consumed_old, chan)
245 == 0)
246 return 0;
247
248 return 1;
249
250 }
It seems that lib_ring_buffer_channel_do_read()->lib_ring_buffer_poll_deliver() returned 0 at line 238 ($consume_old:0x86c400000, $commint_count:0x86c00000, $write_offset:0x86d400000).
Buffer type: per UID
Channels:
-------------
- channel0: [enabled]
Attributes:
overwrite mode: 0
subbufers size: 1048576
number of subbufers: 16
switch timer interval: 20000000
read timer interval: 60000000
trace file count: 80
trace file size (bytes): 2097152
output: mmap()
PS: Lttng version is 2.7(I know it is an very old version:) )
Do you have any ideas about this problem? Thanks in advance
Thanks a lot
zhenyu.ren
_______________________________________________
lttng-dev mailing list
lttng-dev at lists.lttng.org
https://lists.lttng.org/cgi-bin/mailman/listinfo/lttng-dev
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.lttng.org/pipermail/lttng-dev/attachments/20211208/24cf6ce2/attachment-0001.htm>
More information about the lttng-dev
mailing list