[ltt-dev] [RFC PATCH] Propose a new kernel tracing framework
Mathieu Desnoyers
mathieu.desnoyers at polymtl.ca
Fri Mar 13 19:02:32 EDT 2009
* Mathieu Desnoyers (mathieu.desnoyers at polymtl.ca) wrote:
> Hi Jiaying,
>
> I'm very interested to hear about the details of your test-case, because
> what you are talking about here, and the numbers that you get, are, at
> the very least, not representative of the current LTTng. Perharps you
> used a version right after the kernel summit and stopped following the
> improvements we have done since then ? I also wonder about the instrumentation
> set you enabled in your tests, and whether it's exactly the same in the ktrace
> and LTTng runs (especially system call tracing). Please note that following the
> Kernel Summit 2008, a more active development cycle began, and the time was not
> right for optimisation. But since then a lot of improvement has been done on
> this side.
>
> First I'd like to express my concern about the way you posted your
> numbers :
>
> 1 - You did not CC anyone from the LTTng project.
> 2 - You did not tell which LTTng version you used as a comparison basis.
> 3 - You did not look at the current state of code before posting your
> comparison. (as shows the comment in your patch header)
> 4 - You did not tell the instrumentation set that you enabled from LTTng. Is it
> the default instrumentation ? Did you also enable system call tracing
> with your tracer using set_kernel_trace_flag_all_tasks() ? If not, this
> makes a huge difference on the amount of information gathered.
>
> So let's have a close look at them. I'm re-doing the benchmarks based on
> the current LTTng version below.
>
> > Hello All,
> >
> > We have been working on building a kernel tracer for a production environment.
> > We first considered to use the existing markers code, but found it a bit
> > heavyweight.
>
> Have you looked at :
>
> 1 - The tracepoints currently in the kernel
> 2 - The _current_ version of markers in the LTTng tree ?
>
> > So we developed a new kernel tracing prototype that shares
> > many similarities as markers but also includes several simplifications that
> > give us considerable performance benefits. For example, we pass the size of
> > trace event directly to the probe function instead of using va_arg as that
> > used in markers, which saves us the overhead of variable passing.
>
> The markers in the LTTng tree only use va_args for "debug-style"
> tracing. All the tracing fast paths are now done with custom probe
> functions which takes the arguments from the tracepoint and write them
> in the trace buffers directly in C. When the event payload size is already
> known statically, there is no dynamic computation cost at all : it's
> simply passed to the ltt-relay write function.
>
> > Also, we
> > associate a single probe function to each event, so we can directly call the
> > probe function when tracing is enabled.
>
> You seem to imply that being stuck with a single probe function is
> somehow superior to have tiny per-event callbacks which then call into a
> buffer write function ? Why ?
>
> > Our measurements show that our
> > current tracing prototype introduces much lower overhead compared with
> > lttng (we use lttng as a comparison because markers does not provide the
> > implementation to record trace data).
>
> Why LTTng version, kernel version, LTTng config and runtime options do
> you use ?
>
> > E.g., with dbench, lttng introduces
> > 20~80% overhead on our testing platforms while the overhead with our
> > kernel tracer is within 10%; with tbench, lttng usually adds more than 30%
> > overhead while the overhead added by our kernel tracer is usually within
> > 15%; with kernbench (time to compile a Linux kernel), the overhead
> > added by lttng is around 4%~18% while the overhead added by our
> > kernel tracer is within 2.5%.
> >
>
> I'm re-running the dbench and tbench tests :
>
> With LTTng 0.106, kernel 2.6.29-rc7
> 8-cores Intel(R) Xeon(R) CPU E5405 @ 2.00GHz, 16GB ram
>
> Arming all the default LTTng instrumentation (168 tracepoints),
> Tracing in flight recorder mode, including system call entry/exit
> tracing and block I/O tracing (blktrace instrumentation), I get :
>
> Test / config | Vanilla kernel | LTTng 0.106 | slowdown
> dbench 8 -t 200 | 1361 MB/s | 1193 MB/s | 12.3 %
>
> As an example of the effect of custom probe vs va_args based probes,
> let's take the tbench example
>
> | Vanilla | LTTng 0.106 | slowdown
> tbench 8 -t 200 | 2074 MB/s | 997 MB/s | 52 %
>
> The LTTng trace statistics shows the top high event rate events :
>
> Some are usual (therefore lttng already implements custom probes for them) :
>
> 42004 sched_try_wakeup
> 79754 timer_set
> 83973 sched_schedule
> 84317 softirq_exit
> 84318 softirq_entry
> 95370 page_free
> 95412 page_alloc
> 267247 syscall_entry
> 267247 syscall_exit
>
> Some are more tbench workload-specific :
>
> 16915 socket_sendmsg
> 18911 napi_complete
> 18911 napi_schedule
> 18912 dev_receive
> 18914 dev_xmit
> 18920 napi_poll
> 33831 socket_recvmsg
> 59860 read
>
>
> So let's see what happens if I implement those as custom probes and do a bit of
> tuning in the tracer hot path :
>
> | Vanilla | LTTng 0.107 | slowdown
> Custom probes :
> tbench 8 -t 200 | 2074 MB/s | 1275 MB/s | 38.5 %
>
> With tracer hot path tuning (inline hot path, function calls for slow path, will
> be in 0.108) :
> tbench 8 -t 200 | 2074 MB/s | 1335 MB/s | 35.6 %
>
After fixing false sharing problems found using oprofile (got burned by
kzalloc_node, thinking it was somehow cacheline aligned, but is not), updated
tbench performance results (this will be in lttng 0.108) :
Vanilla : 2070 MB/sec
1405 MB/sec 32 % slowdown
Removing the rdtsc barriers (ouch, the kernel uses mfence and lfence in the
recent x86 implementations). The question is : do we keep them or not, or is a
sync_core() (cpuid, synchronizing the core) enough ?
1450 MB/sec 29 % slowdown
Some of the most active instructions in the tracer fast path :
....
2982 ffffffff803fefc2 3.0242
rcu list of active traces prefetch0
5000 ffffffff803ff379 5.0708
xadd (commit)
13525 ffffffff803ff04f 13.7166
rdtsc
Basically, simply using oprofile opreport -i on _ltt_specialized_trace
and sorting by number of samples highlights the fastpath hot spots.
Mathieu
> kernbench :
> Average Optimal load -j 32 Run
>
> Vanilla 2.6.29-rc7 :
> Elapsed Time 48.238 (0.117346)
>
> With flight recorder tracing, default lttng instrumentation, lttng 0.108 :
> Elapsed Time 50.226 (0.0750333)
>
> Slowdown : 4.1 %
>
> So except the tbench workload on LTTng 0.106, which still had a 52% performance
> impact (brought it down to 35 % for the next LTTng release) due to
> specific workload and not having specialized probes implemented, the
> dbench impact is well below your results (12.3 % vs 20-80%)
>
> So given the huge result discrepancies between your tests and mine, could you
> give us more detail about your benchmarks ? Especially knowing the enabled
> instrumentation set, and the amount of MB/s generated by both ktrace and LTTng
> runs (to make sure we are comparing the same traffic) and the LTTng versions you
> are using as a comparison basis would be good.
>
> Thanks,
>
> Mathieu
>
>
> > The patch below is the current prototype we have implemented. It uses
> > Steve's unified trace buffer code with some small extensions to support
> > buffer mmap. The patch is probably still buggy, but we would like to get
> > some early feedback from the community.
> >
> > Signed-off-by: Jiaying Zhang <jiayingz at google.com>
> >
> > Index: git-linux/init/Kconfig
> > ===================================================================
> > --- git-linux.orig/init/Kconfig 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/init/Kconfig 2009-02-19 15:50:15.000000000 -0800
> > @@ -950,6 +950,15 @@
> > Place an empty function call at each marker site. Can be
> > dynamically changed for a probe function.
> >
> > +config KTRACE
> > + bool "Enable ktrace support"
> > + select DEBUG_FS
> > + select TRACING
> > + select RING_BUFFER
> > + help
> > + Ktrace is a kernel tracing tool that allows you to trace
> > + kernel events by inserting trace points at proper places.
> > +
> > source "arch/Kconfig"
> >
> > endmenu # General setup
> > Index: git-linux/kernel/Makefile
> > ===================================================================
> > --- git-linux.orig/kernel/Makefile 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/kernel/Makefile 2009-02-19 14:58:39.000000000 -0800
> > @@ -86,6 +86,7 @@
> > obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> > obj-$(CONFIG_TASKSTATS) += taskstats.o tsacct.o
> > obj-$(CONFIG_MARKERS) += marker.o
> > +obj-$(CONFIG_KTRACE) += ktrace.o
> > obj-$(CONFIG_TRACEPOINTS) += tracepoint.o
> > obj-$(CONFIG_LATENCYTOP) += latencytop.o
> > obj-$(CONFIG_HAVE_GENERIC_DMA_COHERENT) += dma-coherent.o
> > Index: git-linux/include/asm-generic/vmlinux.lds.h
> > ===================================================================
> > --- git-linux.orig/include/asm-generic/vmlinux.lds.h 2009-02-19
> > 14:58:37.000000000 -0800
> > +++ git-linux/include/asm-generic/vmlinux.lds.h 2009-02-19
> > 14:58:39.000000000 -0800
> > @@ -73,6 +73,9 @@
> > MEM_KEEP(init.data) \
> > MEM_KEEP(exit.data) \
> > . = ALIGN(8); \
> > + VMLINUX_SYMBOL(__start___ktraces) = .; \
> > + *(__ktrace) \
> > + VMLINUX_SYMBOL(__stop___ktraces) = .; \
> > VMLINUX_SYMBOL(__start___markers) = .; \
> > *(__markers) \
> > VMLINUX_SYMBOL(__stop___markers) = .; \
> > @@ -89,6 +92,7 @@
> > VMLINUX_SYMBOL(__start_rodata) = .; \
> > *(.rodata) *(.rodata.*) \
> > *(__vermagic) /* Kernel version magic */ \
> > + *(__ktrace_strings) /* Ktrace: strings */ \
> > *(__markers_strings) /* Markers: strings */ \
> > *(__tracepoints_strings)/* Tracepoints: strings */ \
> > } \
> > Index: git-linux/kernel/module.c
> > ===================================================================
> > --- git-linux.orig/kernel/module.c 2009-02-19 14:58:37.000000000 -0800
> > +++ git-linux/kernel/module.c 2009-02-19 14:58:39.000000000 -0800
> > @@ -50,6 +50,7 @@
> > #include <asm/sections.h>
> > #include <linux/tracepoint.h>
> > #include <linux/ftrace.h>
> > +#include <linux/ktrace.h>
> > #include <linux/async.h>
> >
> > #if 0
> > @@ -2145,6 +2146,10 @@
> > sizeof(*mod->tracepoints),
> > &mod->num_tracepoints);
> > #endif
> > +#ifdef CONFIG_KTRACE
> > + mod->ktrace = section_objs(hdr, sechdrs, secstrings, "__ktrace",
> > + sizeof(*mod->ktrace), &mod->num_ktrace);
> > +#endif
> >
> > #ifdef CONFIG_MODVERSIONS
> > if ((mod->num_syms && !mod->crcs)
> > Index: git-linux/include/linux/module.h
> > ===================================================================
> > --- git-linux.orig/include/linux/module.h 2009-02-19
> > 14:58:37.000000000 -0800
> > +++ git-linux/include/linux/module.h 2009-02-19 14:58:39.000000000 -0800
> > @@ -44,6 +44,7 @@
> > };
> >
> > struct module;
> > +struct kernel_trace;
> >
> > struct module_attribute {
> > struct attribute attr;
> > @@ -347,6 +348,11 @@
> > /* Reference counts */
> > struct module_ref ref[NR_CPUS];
> > #endif
> > +#ifdef CONFIG_KTRACE
> > + struct kernel_trace *ktrace;
> > + unsigned int num_ktrace;
> > +#endif
> > +
> > };
> > #ifndef MODULE_ARCH_INIT
> > #define MODULE_ARCH_INIT {}
> > Index: git-linux/include/linux/ktrace.h
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ git-linux/include/linux/ktrace.h 2009-02-19 15:36:39.000000000 -0800
> > @@ -0,0 +1,83 @@
> > +#ifndef _LINUX_KTRACE_H
> > +#define _LINUX_KTRACE_H
> > +
> > +#include <linux/types.h>
> > +
> > +struct kernel_trace;
> > +
> > +typedef void ktrace_probe_func(struct kernel_trace *, void *, size_t);
> > +
> > +struct kernel_trace {
> > + const char *name;
> > + const char *format; /* format string describing variable list */
> > + size_t *stroff; /* offsets of string variables */
> > + /* 31 bit event_id is converted to 16 bit when entering to the buffer */
> > + u32 enabled:1, event_id:31;
> > + ktrace_probe_func *func; /* probe function */
> > + struct list_head list; /* list head linked to the hash table entry */
> > +} __attribute__((aligned(8)));
> > +
> > +extern int ktrace_enabled;
> > +
> > +/*
> > + * Make sure the alignment of the structure in the __ktrace section will
> > + * not add unwanted padding between the beginning of the section and the
> > + * structure. Force alignment to the same alignment as the section start.
> > + */
> > +
> > +#define DEFINE_KTRACE_STRUCT(name) \
> > + struct __attribute__((packed)) ktrace_struct_##name
> > +
> > +#ifdef CONFIG_KTRACE
> > +/*
> > + * DO_TRACE is the macro that is called from each trace point. Each trace
> > + * point is associated with a static ktrace object that is located in a
> > + * special __ktrace section and is uniquely identified by the event name and
> > + * event format. Take a look at the files under include/trace for example
> > + * usage. Note that we call the trace probe function only if the global
> > + * tracing is enabled and the tracing for the associated event is enabled,
> > + * so we only introduce the overhead of a predicted condition judge when
> > + * tracing is disabled.
> > + */
> > +#define DO_TRACE(name, format, args...)
> > \
> > + do { \
> > + static const char __kstrtab_##name[] \
> > + __attribute__((section("__ktrace_strings"))) \
> > + = #name "\0" format; \
> > + static struct kernel_trace __ktrace_##name \
> > + __attribute__((section("__ktrace"), aligned(8))) = \
> > + { __kstrtab_##name, &__kstrtab_##name[sizeof(#name)], \
> > + NULL, 0, 0, NULL, LIST_HEAD_INIT(__ktrace_##name.list) }; \
> > + __ktrace_check_format(format, ## args); \
> > + if (unlikely(ktrace_enabled) && \
> > + unlikely(__ktrace_##name.enabled)) { \
> > + struct ktrace_struct_##name karg = { args }; \
> > + (*__ktrace_##name.func) \
> > + (&__ktrace_##name, &karg, sizeof(karg)); \
> > + } \
> > + } while (0)
> > +
> > +#else /* !CONFIG_KTRACE */
> > +#define DO_TRACE(name, format, args...)
> > +#endif
> > +
> > +/* To be used for string format validity checking with gcc */
> > +static inline void __printf(1, 2) ___ktrace_check_format(const char *fmt, ...)
> > +{
> > +}
> > +
> > +#define __ktrace_check_format(format, args...) \
> > + do { \
> > + if (0) \
> > + ___ktrace_check_format(format, ## args); \
> > + } while (0)
> > +
> > +
> > +/* get the trace buffer information */
> > +#define KTRACE_BUF_GET_SIZE _IOR(0xF5, 0x00, __u32)
> > +#define KTRACE_BUF_GET_PRODUCED _IOR(0xF5, 0x01, __u32)
> > +#define KTRACE_BUF_GET_CONSUMED _IOR(0xF5, 0x02, __u32)
> > +#define KTRACE_BUF_PUT_PRODUCED _IOW(0xF5, 0x03, __u32)
> > +#define KTRACE_BUF_PUT_CONSUMED _IOW(0xF5, 0x04, __u32)
> > +
> > +#endif
> > Index: git-linux/kernel/ktrace.c
> > ===================================================================
> > --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> > +++ git-linux/kernel/ktrace.c 2009-02-19 15:45:20.000000000 -0800
> > @@ -0,0 +1,872 @@
> > +/*
> > + * kernel/ktrace.c
> > + *
> > + * Implementation of the kernel tracing for linux kernel 2.6.
> > + *
> > + * Copyright 2008- Google Inc.
> > + * Original Author: Jiaying Zhang
> > + */
> > +
> > +#include <linux/module.h>
> > +#include <linux/kernel.h>
> > +#include <linux/bitmap.h>
> > +#include <linux/poll.h>
> > +#include <linux/errno.h>
> > +#include <linux/sched.h>
> > +#include <linux/linkage.h>
> > +#include <linux/debugfs.h>
> > +#include <linux/timer.h>
> > +#include <linux/delay.h>
> > +#include <linux/jhash.h>
> > +#include <linux/ctype.h>
> > +#include <linux/namei.h>
> > +#include <linux/ktrace.h>
> > +#include <linux/ring_buffer.h>
> > +
> > +static char *ktrace_version = "1.0.0";
> > +int ktrace_enabled;
> > +EXPORT_SYMBOL(ktrace_enabled);
> > +static uint16_t ktrace_next_id;
> > +
> > +extern struct kernel_trace __start___ktraces[];
> > +extern struct kernel_trace __stop___ktraces[];
> > +
> > +static DEFINE_MUTEX(ktrace_mutex);
> > +
> > +/* ktrace hash table entry structure and variables */
> > +struct ktrace_entry {
> > + struct hlist_node hlist;
> > + char *name;
> > + char *format;
> > + size_t *stroff;
> > + ktrace_probe_func *func;
> > + u32 enabled:1, event_id:31;
> > + struct list_head klist; /* list of loaded ktraces */
> > +};
> > +
> > +#define KTRACE_HASH_BITS 6
> > +#define KTRACE_HASH_SIZE (1 << KTRACE_HASH_BITS)
> > +static struct hlist_head ktrace_table[KTRACE_HASH_SIZE];
> > +
> > +/* debugfs directory variables */
> > +static struct dentry *tracedir;
> > +static struct dentry *eventdir;
> > +typedef void tracecontrol_handle_func(struct file *, unsigned long, size_t *);
> > +
> > +/* ring buffer code */
> > +static unsigned long trace_buf_size = 65536UL;
> > +struct ring_buffer *trace_buffer;
> > +static DECLARE_WAIT_QUEUE_HEAD(trace_wait); /* waitqueue for trace_data_poll */
> > +static struct timer_list trace_timer; /* reader wake-up timer */
> > +#define TRACEREAD_WAKEUP_INTERVAL 1000 /* time interval in jiffies */
> > +static struct kref ktrace_kref;
> > +
> > +static void
> > +ring_buffer_probe(struct kernel_trace *kt, void *data, size_t event_size)
> > +{
> > + struct ring_buffer_event *event;
> > + void *trace_event;
> > + unsigned long flag;
> > +
> > + event = ring_buffer_lock_reserve(trace_buffer,
> > + sizeof(uint16_t) + event_size, &flag);
> > + if (!event)
> > + return;
> > + trace_event = ring_buffer_event_data(event);
> > + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> > + memcpy(trace_event + sizeof(uint16_t), data, event_size);
> > + ring_buffer_unlock_commit(trace_buffer, event, flag);
> > +}
> > +
> > +/* special probe function for events that contain string arguments */
> > +static void
> > +string_probe(struct kernel_trace *kt, void *data, size_t event_size)
> > +{
> > + struct ring_buffer_event *event;
> > + void *trace_event, *p;
> > + unsigned long flag;
> > + size_t *offset, scanned;
> > + char *string;
> > +
> > + /*
> > + * Get the real length of the event, i.e., we use stroff array to
> > + * locate string variables in the passed-in trace data and use their
> > + * length to replace the size of the string pointers.
> > + */
> > + for (offset = kt->stroff; *offset != -1; offset++) {
> > + string = *(char **) (data + *offset);
> > + event_size += strlen(string) + 1 - sizeof(char *);
> > + }
> > + event = ring_buffer_lock_reserve(trace_buffer,
> > + sizeof(uint16_t) + event_size, &flag);
> > + if (!event)
> > + return;
> > + trace_event = ring_buffer_event_data(event);
> > + *(uint16_t *) trace_event = (uint16_t) kt->event_id;
> > + p = trace_event + sizeof(uint16_t);
> > + /*
> > + * Copy the trace data into buffer. For string variables, we enter the
> > + * string into the buffer. Otherwise, the passed in data is copied.
> > + */
> > + for (offset = kt->stroff, scanned = 0; *offset != -1; offset++) {
> > + memcpy(p, data + scanned, *offset - scanned);
> > + p += *offset - scanned;
> > + string = *(char **) (data + *offset);
> > + memcpy(p, string, strlen(string) + 1);
> > + p += strlen(string) + 1;
> > + scanned = *offset + sizeof(char *);
> > + }
> > + memcpy(p, data + scanned,
> > + trace_event + sizeof(uint16_t) + event_size - p);
> > + ring_buffer_unlock_commit(trace_buffer, event, flag);
> > +}
> > +
> > +/* timer function used to defer ring buffer reader waking */
> > +static void wakeup_readers(unsigned long data)
> > +{
> > + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> > + wake_up_interruptible(&trace_wait);
> > + mod_timer(&trace_timer, jiffies + TRACEREAD_WAKEUP_INTERVAL);
> > +}
> > +
> > +/* function for reading ktrace metadata info */
> > +static void *tracing_info_start(struct seq_file *seq, loff_t *pos)
> > +{
> > + seq_printf(seq, "version %s\n", ktrace_version);
> > + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> > +}
> > +
> > +static void *tracing_info_next(struct seq_file *seq, void *v, loff_t *pos)
> > +{
> > + ++*pos;
> > + return (*pos >= KTRACE_HASH_SIZE) ? NULL : &ktrace_table[*pos];
> > +}
> > +
> > +static int tracing_info_show(struct seq_file *seq, void *v)
> > +{
> > + struct hlist_head *head = v;
> > + struct hlist_node *node;
> > + struct ktrace_entry *entry;
> > + hlist_for_each_entry(entry, node, head, hlist)
> > + seq_printf(seq, "name '%s' format '%s' id %u %s\n",
> > + entry->name, entry->format, entry->event_id,
> > + entry->enabled ? "enabled" : "disabled");
> > + return 0;
> > +}
> > +
> > +static void tracing_info_stop(struct seq_file *seq, void *v)
> > +{
> > +}
> > +
> > +static const struct seq_operations tracing_info_seq_ops = {
> > + .start = tracing_info_start,
> > + .next = tracing_info_next,
> > + .show = tracing_info_show,
> > + .stop = tracing_info_stop,
> > +};
> > +
> > +static int tracing_info_open(struct inode *inode, struct file *file)
> > +{
> > + return seq_open(file, &tracing_info_seq_ops);
> > +}
> > +
> > +static const struct file_operations traceinfo_file_operations = {
> > + .open = tracing_info_open,
> > + .read = seq_read,
> > + .llseek = seq_lseek,
> > + .release = seq_release,
> > +};
> > +
> > +/*
> > + * wrapper function used by debugfs write operation.
> > + * func: handling function that does real work
> > + */
> > +static ssize_t trace_debugfs_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos, tracecontrol_handle_func *func)
> > +{
> > + int ret;
> > + char buf[64];
> > + unsigned long val;
> > +
> > + if (cnt >= sizeof(buf))
> > + return -EINVAL;
> > + if (copy_from_user(&buf, ubuf, cnt))
> > + return -EFAULT;
> > + buf[cnt] = 0;
> > +
> > + ret = strict_strtoul(buf, 10, &val);
> > + if (ret < 0)
> > + return ret;
> > + val = !!val;
> > +
> > + func(filp, val, &cnt);
> > +
> > + filp->f_pos += cnt;
> > + return cnt;
> > +}
> > +
> > +/* functions for reading/writing the global 'enabled' control file */
> > +static ssize_t tracing_control_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + ret = snprintf(buf, 64, "%u\n", ktrace_enabled);
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +static void __tracing_control_write(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + if (val ^ ktrace_enabled) {
> > + if (val) {
> > + trace_timer.expires =
> > + jiffies + TRACEREAD_WAKEUP_INTERVAL;
> > + add_timer(&trace_timer);
> > + ktrace_enabled = 1;
> > + } else {
> > + ktrace_enabled = 0;
> > + del_timer_sync(&trace_timer);
> > + if (trace_buffer && !ring_buffer_empty(trace_buffer))
> > + wake_up_interruptible(&trace_wait);
> > + }
> > + }
> > +}
> > +
> > +static ssize_t tracing_control_write(struct file *filp, const char
> > __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __tracing_control_write);
> > +}
> > +
> > +static const struct file_operations tracecontrol_file_operations = {
> > + .read = tracing_control_read,
> > + .write = tracing_control_write,
> > +};
> > +
> > +/* functions for reading/writing a trace event file */
> > +static int update_ktrace(struct ktrace_entry *entry,
> > + ktrace_probe_func *func, int enabled)
> > +{
> > + struct kernel_trace *iter;
> > + /* no need to update the list if the tracing is not initialized */
> > + if (!tracedir)
> > + return 0;
> > + entry->enabled = enabled;
> > + entry->func = func;
> > + list_for_each_entry(iter, &entry->klist, list) {
> > + iter->stroff = entry->stroff;
> > + iter->func = func;
> > + iter->enabled = enabled;
> > + }
> > + return 0;
> > +}
> > +
> > +static int ktrace_probe_register(struct ktrace_entry *entry,
> > + ktrace_probe_func *probefunc)
> > +{
> > + return update_ktrace(entry, probefunc, 1);
> > +}
> > +
> > +static int ktrace_probe_unregister(struct ktrace_entry *entry)
> > +{
> > + return update_ktrace(entry, NULL, 0);
> > +}
> > +
> > +static int ktrace_control_open(struct inode *inode, struct file *filp)
> > +{
> > + filp->private_data = inode->i_private;
> > + return 0;
> > +}
> > +
> > +static ssize_t ktrace_control_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + struct ktrace_entry *entry = filp->private_data;
> > +
> > + ret = snprintf(buf, 64, "%u\n", entry->enabled);
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +/*
> > + * Check whether the event contains string arguments. Return 0 if not;
> > + * otherwise record the offsets of the string arguments in the ktrace
> > + * structure according to the event format and return 1.
> > + * The offset info is later used to store the string arguments into
> > + * the trace buffer during trace probe.
> > + */
> > +static int contain_string_arguments(struct ktrace_entry *entry)
> > +{
> > + int count = 0;
> > + char *fmt;
> > + int qualifier; /* 'h', 'l', or 'L' for integer fields */
> > + size_t offset = 0;
> > +
> > + for (fmt = strstr(entry->format, "%s"); fmt; fmt = strstr(++fmt, "%s"))
> > + count++;
> > + if (!count)
> > + return 0;
> > + entry->stroff = kmalloc(sizeof(size_t) * (count + 1), GFP_KERNEL);
> > + if (!entry->stroff)
> > + return -ENOMEM;
> > +
> > + for (fmt = entry->format, count = 0; *fmt ; ++fmt) {
> > + if (*fmt != '%')
> > + continue;
> > +repeat:
> > + ++fmt;
> > + switch (*fmt) {
> > + case '-':
> > + case '+':
> > + case ' ':
> > + case '#':
> > + case '0':
> > + ++fmt;
> > + goto repeat;
> > + }
> > +
> > + while (isdigit(*fmt))
> > + fmt++;
> > +
> > + /* get the conversion qualifier */
> > + qualifier = -1;
> > + if (*fmt == 'h' || *fmt == 'l' || *fmt == 'L') {
> > + qualifier = *fmt;
> > + ++fmt;
> > + if (qualifier == 'l' && *fmt == 'l') {
> > + qualifier = 'L';
> > + ++fmt;
> > + }
> > + }
> > +
> > + switch (*fmt) {
> > + case 'c':
> > + offset += sizeof(char);
> > + continue;
> > + case 's':
> > + entry->stroff[count] = offset;
> > + count++;
> > + offset += sizeof(char *);
> > + continue;
> > + case 'p':
> > + offset += sizeof(void *);
> > + continue;
> > + case 'd':
> > + case 'i':
> > + case 'o':
> > + case 'u':
> > + case 'x':
> > + case 'X':
> > + switch (qualifier) {
> > + case 'L':
> > + offset += sizeof(long long);
> > + break;
> > + case 'l':
> > + offset += sizeof(long);
> > + break;
> > + case 'h':
> > + offset += sizeof(short);
> > + break;
> > + default:
> > + offset += sizeof(int);
> > + break;
> > + }
> > + break;
> > + default:
> > + printk(KERN_WARNING "unknown format '%c'", *fmt);
> > + continue;
> > + }
> > + }
> > + entry->stroff[count] = -1;
> > + return 1;
> > +}
> > +
> > +static void __ktrace_control_write(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + struct ktrace_entry *entry = filp->private_data;
> > + int ret;
> > +
> > + mutex_lock(&ktrace_mutex);
> > + if (val ^ entry->enabled) {
> > + if (val) {
> > + ret = contain_string_arguments(entry);
> > + if (ret == 0)
> > + ktrace_probe_register(entry, ring_buffer_probe);
> > + else if (ret > 0)
> > + ktrace_probe_register(entry, string_probe);
> > + else
> > + *cnt = ret;
> > + } else
> > + ktrace_probe_unregister(entry);
> > + }
> > + mutex_unlock(&ktrace_mutex);
> > +}
> > +
> > +static ssize_t ktrace_control_write(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __ktrace_control_write);
> > +}
> > +
> > +static const struct file_operations ktracecontrol_file_operations = {
> > + .open = ktrace_control_open,
> > + .read = ktrace_control_read,
> > + .write = ktrace_control_write,
> > +};
> > +
> > +/* functions for adding/removing a trace event. Protected by mutex lock.
> > + * Called during initialization or after loading a module */
> > +static struct ktrace_entry *add_ktrace(struct kernel_trace *ktrace)
> > +{
> > + struct ktrace_entry *entry;
> > + size_t name_len = strlen(ktrace->name) + 1;
> > + u32 hash = jhash(ktrace->name, name_len-1, 0);
> > + struct hlist_head *head;
> > + struct hlist_node *node;
> > + struct dentry *dentry;
> > +
> > + if (!tracedir)
> > + return 0;
> > + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> > + hlist_for_each_entry(entry, node, head, hlist) {
> > + if (!strcmp(ktrace->name, entry->name)) {
> > + if (strcmp(ktrace->format, entry->format)) {
> > + printk(KERN_WARNING "the format of tracepoint "
> > + "\'%s\' changes from \'%s\' to \'%s\'."
> > + "Dynamic changing of trace format is "
> > + "not supported yet!\n",
> > + ktrace->name, entry->format,
> > + ktrace->format);
> > + return ERR_PTR(-EINVAL);
> > + }
> > + ktrace->enabled = entry->enabled;
> > + ktrace->func = entry->func;
> > + ktrace->event_id = entry->event_id;
> > + if (list_empty(&entry->klist))
> > + goto add_head;
> > + list_add_tail(&ktrace->list, &entry->klist);
> > + return entry;
> > + }
> > + }
> > + entry = kmalloc(sizeof(struct ktrace_entry), GFP_KERNEL);
> > + if (!entry)
> > + return ERR_PTR(-ENOMEM);
> > + entry->name = kmalloc(name_len, GFP_KERNEL);
> > + if (!entry->name) {
> > + kfree(entry);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + memcpy(entry->name, ktrace->name, name_len);
> > + if (ktrace->format) {
> > + size_t format_len = strlen(ktrace->format) + 1;
> > + entry->format = kmalloc(format_len, GFP_KERNEL);
> > + if (!entry->format) {
> > + kfree(entry->name);
> > + kfree(entry);
> > + return ERR_PTR(-ENOMEM);
> > + }
> > + memcpy(entry->format, ktrace->format, format_len);
> > + } else
> > + entry->format = NULL;
> > + entry->func = ktrace->func;
> > + entry->enabled = 0;
> > + ktrace->event_id = entry->event_id = ktrace_next_id++;
> > + entry->stroff = NULL;
> > + INIT_LIST_HEAD(&entry->klist);
> > + hlist_add_head(&entry->hlist, head);
> > +add_head:
> > + list_add(&ktrace->list, &entry->klist);
> > + dentry = debugfs_create_file(entry->name, 0660, eventdir,
> > + entry, &ktracecontrol_file_operations);
> > + if (!dentry)
> > + printk(KERN_WARNING "Couldn't create debugfs '%s' entry\n",
> > + entry->name);
> > + return entry;
> > +}
> > +
> > +static int remove_ktrace(struct kernel_trace *ktrace, int free)
> > +{
> > + struct ktrace_entry *entry;
> > + size_t name_len = strlen(ktrace->name) + 1;
> > + u32 hash = jhash(ktrace->name, name_len-1, 0);
> > + struct hlist_head *head;
> > + struct hlist_node *node, *temp;
> > + struct dentry *dentry;
> > +
> > + if (!tracedir)
> > + return 0;
> > + list_del(&ktrace->list);
> > + head = &ktrace_table[hash & ((1 << KTRACE_HASH_BITS)-1)];
> > + hlist_for_each_entry_safe(entry, node, temp, head, hlist)
> > + if (!strcmp(ktrace->name, entry->name)) {
> > + if (list_empty(&entry->klist)) {
> > + dentry = lookup_one_len(entry->name,
> > + eventdir, strlen(entry->name));
> > + if (dentry && !IS_ERR(dentry)) {
> > + debugfs_remove(dentry);
> > + dput(dentry);
> > + }
> > + entry->enabled = 0;
> > + if (free) {
> > + hlist_del(&entry->hlist);
> > + kfree(entry->name);
> > + kfree(entry->format);
> > + kfree(entry->stroff);
> > + kfree(entry);
> > + }
> > + }
> > + return 0;
> > + }
> > + return -1;
> > +}
> > +
> > +/* Add/remove tracepoints contained in a module to the ktrace hash table.
> > + * Called at the end of module load/unload. */
> > +static int ktrace_module_notify(struct notifier_block *self,
> > + unsigned long val, void *data)
> > +{
> > +#ifdef CONFIG_MODULES
> > + struct kernel_trace *iter;
> > + struct module *mod = data;
> > +
> > + if (val != MODULE_STATE_COMING && val != MODULE_STATE_GOING)
> > + return 0;
> > + mutex_lock(&ktrace_mutex);
> > + for (iter = mod->ktrace; iter < mod->ktrace + mod->num_ktrace; iter++)
> > + if (val == MODULE_STATE_COMING)
> > + add_ktrace(iter);
> > + else
> > + remove_ktrace(iter, 0);
> > + mutex_unlock(&ktrace_mutex);
> > +#endif
> > + return 0;
> > +}
> > +
> > +static struct notifier_block ktrace_module_nb = {
> > + .notifier_call = ktrace_module_notify,
> > +};
> > +
> > +/* functions for user-space programs to read data from tracing buffer */
> > +static int trace_data_open(struct inode *inode, struct file *filp)
> > +{
> > + filp->private_data = inode->i_private;
> > + kref_get(&ktrace_kref);
> > + return 0;
> > +}
> > +
> > +static void release_trace_buffer(struct kref *kref)
> > +{
> > + ring_buffer_free(trace_buffer);
> > + trace_buffer = NULL;
> > +}
> > +
> > +static int trace_data_release(struct inode *inode, struct file *filp)
> > +{
> > + kref_put(&ktrace_kref, release_trace_buffer);
> > + return 0;
> > +}
> > +
> > +static unsigned int trace_data_poll(struct file *filp, poll_table *poll_table)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > + if (filp->f_mode & FMODE_READ) {
> > + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> > + return POLLIN | POLLRDNORM;
> > + poll_wait(filp, &trace_wait, poll_table);
> > + if (!ring_buffer_per_cpu_empty(cpu_buffer))
> > + return POLLIN | POLLRDNORM;
> > + }
> > + return 0;
> > +}
> > +
> > +static int trace_buf_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = vma->vm_private_data;
> > + pgoff_t pgoff = vmf->pgoff;
> > + struct page *page;
> > +
> > + if (!trace_buffer)
> > + return VM_FAULT_OOM;
> > + page = ring_buffer_get_page(cpu_buffer, pgoff);
> > + if (page == NULL)
> > + return VM_FAULT_SIGBUS;
> > + get_page(page);
> > + vmf->page = page;
> > + return 0;
> > +}
> > +
> > +static struct vm_operations_struct trace_data_mmap_ops = {
> > + .fault = trace_buf_fault,
> > +};
> > +
> > +static int trace_data_mmap(struct file *filp, struct vm_area_struct *vma)
> > +{
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > +
> > + vma->vm_ops = &trace_data_mmap_ops;
> > + vma->vm_flags |= VM_DONTEXPAND;
> > + vma->vm_private_data = cpu_buffer;
> > + return 0;
> > +}
> > +
> > +static int trace_data_ioctl(struct inode *inode, struct file *filp,
> > + unsigned int cmd, unsigned long arg)
> > +{
> > + u32 __user *argp = (u32 __user *)arg;
> > + struct ring_buffer_per_cpu *cpu_buffer = filp->private_data;
> > +
> > + if (!trace_buffer || !tracedir)
> > + return -EINVAL;
> > + switch (cmd) {
> > + case KTRACE_BUF_GET_SIZE:
> > + {
> > + unsigned long bufsize;
> > + bufsize = ring_buffer_size(trace_buffer);
> > + return put_user((u32)bufsize, argp);
> > + }
> > + case KTRACE_BUF_GET_PRODUCED:
> > + return put_user(ring_buffer_get_produced(cpu_buffer), argp);
> > + case KTRACE_BUF_GET_CONSUMED:
> > + return put_user(ring_buffer_get_consumed(cpu_buffer), argp);
> > + case KTRACE_BUF_PUT_CONSUMED:
> > + {
> > + u32 consumed, consumed_old;
> > + int ret;
> > +
> > + ret = get_user(consumed, argp);
> > + if (ret) {
> > + printk(KERN_WARNING
> > + "error getting consumed value: %d\n", ret);
> > + return ret;
> > + }
> > + consumed_old = ring_buffer_get_consumed(cpu_buffer);
> > + if (consumed == consumed_old)
> > + return 0;
> > + ring_buffer_advance_reader(cpu_buffer, consumed - consumed_old);
> > + return 0;
> > + }
> > + default:
> > + return -ENOIOCTLCMD;
> > + }
> > + return 0;
> > +}
> > +
> > +static const struct file_operations tracedata_file_operations = {
> > + .open = trace_data_open,
> > + .poll = trace_data_poll,
> > + .mmap = trace_data_mmap,
> > + .ioctl = trace_data_ioctl,
> > + .release = trace_data_release,
> > +};
> > +
> > +/* function to report the number of lost events due to buffer overflow */
> > +static ssize_t trace_overflow_read(struct file *filp, char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + int ret;
> > + char buf[64];
> > + ret = snprintf(buf, 64, "%lu\n", ring_buffer_overruns(trace_buffer));
> > + return simple_read_from_buffer(ubuf, cnt, ppos, buf, ret);
> > +}
> > +
> > +static const struct file_operations traceoverflow_file_operations = {
> > + .read = trace_overflow_read,
> > +};
> > +
> > +/* function to re-initialize trace buffer */
> > +static void __trace_buffer_reset(struct file *filp,
> > + unsigned long val, size_t *cnt)
> > +{
> > + if (val && trace_buffer)
> > + ring_buffer_reset(trace_buffer);
> > +}
> > +
> > +static ssize_t trace_buffer_reset(struct file *filp, const char __user *ubuf,
> > + size_t cnt, loff_t *ppos)
> > +{
> > + return trace_debugfs_write(filp, ubuf, cnt, ppos,
> > + __trace_buffer_reset);
> > +}
> > +
> > +static const struct file_operations tracereset_file_operations = {
> > + .write = trace_buffer_reset,
> > +};
> > +
> > +/*
> > + * We use debugfs for kernel-user communication. All of the control/info
> > + * files are under debugfs/ktrace directory.
> > + */
> > +static int create_debugfs(void)
> > +{
> > + struct dentry *entry, *bufdir;
> > + struct ring_buffer_per_cpu *cpu_buffer;
> > + char *tmpname;
> > + int cpu;
> > +
> > + tracedir = debugfs_create_dir("ktrace", NULL);
> > + if (!tracedir) {
> > + printk(KERN_WARNING "Couldn't create debugfs directory\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* 'buffers' is the directory that holds trace buffer debugfs files */
> > + bufdir = debugfs_create_dir("buffers", tracedir);
> > + if (!bufdir) {
> > + printk(KERN_WARNING "Couldn't create 'buffers' directory\n");
> > + return -ENOMEM;
> > + }
> > +
> > + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> > + if (!tmpname)
> > + return -ENOMEM;
> > + /* create a debugfs file for each cpu buffer */
> > + for_each_possible_cpu(cpu) {
> > + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> > + cpu_buffer = ring_buffer_cpu(trace_buffer, cpu);
> > + entry = debugfs_create_file(tmpname, 0440, bufdir, cpu_buffer,
> > + &tracedata_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs \'%s\' entry\n",
> > + tmpname);
> > + return -ENOMEM;
> > + }
> > + }
> > + kfree(tmpname);
> > +
> > + /* the control file for users to enable/disable global tracing. */
> > + entry = debugfs_create_file("enabled", 0664, tracedir, NULL,
> > + &tracecontrol_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs 'enabled' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /*
> > + * the debugfs file that displays the name, format etc. of every
> > + * supported trace event. The file is to be used by the user-space
> > + * trace parser to analyze the collected trace data.
> > + */
> > + entry = debugfs_create_file("info", 0444, tracedir, NULL,
> > + &traceinfo_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING "Couldn't create debugfs 'info' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /*
> > + * the debugfs file that reports the number of events
> > + * lost due to buffer overflow
> > + */
> > + entry = debugfs_create_file("overflow", 0444, tracedir, NULL,
> > + &traceoverflow_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING
> > + "Couldn't create debugfs 'overflow' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* the debugfs file that resets trace buffer upon write */
> > + entry = debugfs_create_file("reset", 0220, tracedir, NULL,
> > + &tracereset_file_operations);
> > + if (!entry) {
> > + printk(KERN_WARNING "Couldn't create debugfs 'reset' entry\n");
> > + return -ENOMEM;
> > + }
> > +
> > + /* create the directory that holds the control files for every event */
> > + eventdir = debugfs_create_dir("events", tracedir);
> > + if (!eventdir) {
> > + printk(KERN_WARNING "Couldn't create 'events' directory\n");
> > + return -ENOMEM;
> > + }
> > + return 0;
> > +}
> > +
> > +static void remove_debugfs_file(struct dentry *dir, char *name)
> > +{
> > + struct dentry *dentry;
> > + dentry = lookup_one_len(name, dir, strlen(name));
> > + if (dentry && !IS_ERR(dentry)) {
> > + debugfs_remove(dentry);
> > + dput(dentry);
> > + }
> > +}
> > +
> > +static int remove_debugfs(void)
> > +{
> > + struct dentry *bufdir;
> > + char *tmpname;
> > + int cpu;
> > +
> > + bufdir = lookup_one_len("buffers", tracedir, strlen("buffers"));
> > + if (!bufdir || IS_ERR(bufdir))
> > + return -EIO;
> > + tmpname = kzalloc(NAME_MAX + 1, GFP_KERNEL);
> > + if (!tmpname)
> > + return -ENOMEM;
> > + for_each_possible_cpu(cpu) {
> > + snprintf(tmpname, NAME_MAX, "%s%d", "cpu", cpu);
> > + remove_debugfs_file(bufdir, tmpname);
> > + }
> > + kfree(tmpname);
> > + debugfs_remove(bufdir);
> > +
> > + remove_debugfs_file(tracedir, "enabled");
> > + remove_debugfs_file(tracedir, "info");
> > + debugfs_remove(eventdir);
> > + debugfs_remove(tracedir);
> > + return 0;
> > +}
> > +
> > +static int __init init_ktrace(void)
> > +{
> > + struct kernel_trace *iter;
> > + int i, err;
> > +
> > + ktrace_next_id = 0;
> > + kref_set(&ktrace_kref, 1);
> > + trace_buffer = ring_buffer_alloc(trace_buf_size, RB_FL_OVERWRITE);
> > + if (trace_buffer == NULL)
> > + return -ENOMEM;
> > + setup_timer(&trace_timer, wakeup_readers, 0);
> > + for (i = 0; i < KTRACE_HASH_SIZE; i++)
> > + INIT_HLIST_HEAD(&ktrace_table[i]);
> > +
> > + mutex_lock(&ktrace_mutex);
> > + err = create_debugfs();
> > + if (err != 0)
> > + goto out;
> > + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> > + add_ktrace(iter);
> > + err = register_module_notifier(&ktrace_module_nb);
> > +out:
> > + mutex_unlock(&ktrace_mutex);
> > + return err;
> > +}
> > +module_init(init_ktrace);
> > +
> > +static void __exit exit_ktrace(void)
> > +{
> > + struct kernel_trace *iter;
> > +
> > + if (ktrace_enabled) {
> > + ktrace_enabled = 0;
> > + del_timer_sync(&trace_timer);
> > + }
> > + mutex_lock(&ktrace_mutex);
> > + for (iter = __start___ktraces; iter < __stop___ktraces; iter++)
> > + remove_ktrace(iter, 1);
> > + unregister_module_notifier(&ktrace_module_nb);
> > + remove_debugfs();
> > + mutex_unlock(&ktrace_mutex);
> > +
> > + kref_put(&ktrace_kref, release_trace_buffer);
> > +}
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
More information about the lttng-dev
mailing list