如何使用perf性能分析工具

在功能上，perf很强大，可以对众多的软硬件事件采样，还能采集出跟踪点（trace points）的信息（比如系统调用、tcp/ip事件和文件系统操作。perf的代码和linux内核代码放在一起，是内核级的工具。perf是在linux上做剖析分析的首选工具。
perf命令介绍perf 工具提供了一组丰富的命令来收集和分析性能和跟踪数据。perf支持的命令如下：
usage: perf [--version] [--help] [options] command [args]
the most commonly used perf commands are:
annotate read perf.data (created by perf record) and display annotated code
archive create archive with object files with build-ids found in perf.data file
bench general framework for benchmark suites
buildid-cache manage build-id cache.
buildid-list list the buildids in a perf.data file
c2c shared data c2c/hitm analyzer.
config get and set variables in a configuration file.
data data file related processing
diff read perf.data files and display the differential profile
evlist list the event names in a perf.data file
ftrace simple wrapper for kernel's ftrace functionality
inject filter to augment the events stream with additional information
kallsyms searches running kernel for symbols
kmem tool to trace/measure kernel memory properties
kvm tool to trace/measure kvm guest os
list list all symbolic event types
lock analyze lock events
mem profile memory accesses
record run a command and record its profile into perf.data
report read perf.data (created by perf record) and display the profile
sched tool to trace/measure scheduler properties (latencies)
script read perf.data (created by perf record) and display trace output
stat run a command and gather performance counter statistics
test runs sanity tests.
timechart tool to visualize total system behavior during a workload
top system profiling tool.
version display the version of perf binary
probe define new dynamic tracepoints
trace strace inspired tool
annotate：读取 perf.data（由 perf record记录）并显示带注释的代码，需要在编译应用程序时加入-g选项
archive：用perf.data文件中找到的build-ids的对象文件创建档案。
bench：对系统调度、内存访问、epoll、futex等进行压力测试。
buildid-cache：管理build-id缓存
buildid-list：列出perf.data文件中的buildids。
c2c：共享数据c2c/hitm分析仪。
config：读取或设置配置文件中的变量
data：数据文件相关处理
diff: 读取perf.data文件并显示差分曲线
ftrace：内核的ftrace功能的简单封装器
inject：用额外的信息来增加事件流的过滤器
kallsyms：搜索运行中的内核中的符号
kmem：追踪/测量内核内存属性的工具
kvm：追踪/测量kvm客户操作系统的工具
list：列出所有象征性的事件类型
lock：分析锁事件
mem：分析内存访问
record：将所有的分析记录进perf.data
report：读取perf.data（由perf记录创建）并显示概况
sched：跟踪/测量调度器属性（延迟）的工具
script：读取perf.data（由perf记录创建）并显示跟踪输出
stat：运行一个命令并收集性能计数器的统计数据
test：测试系统内核支持的功能
timechart：在工作负载期间可视化整个系统行为的工具
top：系统分析工具
probe：定义新的动态跟踪点
trace：strace启发的工具
测试程序：测试程序会一直循环打印a的值，打印一次睡眠一次。我们使用gcc test.c -g -o test将其编译成可执行文件。下面我们将结合此测试程序来使用perf工具进行分析。
#include
void print(void)
{
int i = 0;
while(1){
i++;
}
}
int main ()
{
print();
return 0;
}
listlist命令会列举出perf支持监测的所有事件。
list of pre-defined events (to be used in -e):
branch-instructions or branches [hardware event]
branch-misses [hardware event]
bus-cycles [hardware event]
cache-misses [hardware event]
cache-references [hardware event]
cpu-cycles or cycles [hardware event]
instructions [hardware event]
alignment-faults [software event]
bpf-output [software event]
context-switches or cs [software event]
cpu-clock [software event]
cpu-migrations or migrations [software event]
dummy [software event]
emulation-faults [software event]
major-faults [software event]
minor-faults [software event]
page-faults or faults [software event]
task-clock [software event]
duration_time [tool event]
l1-dcache-load-misses [hardware cache event]
l1-dcache-loads [hardware cache event]
l1-icache-load-misses [hardware cache event]
l1-icache-loads [hardware cache event]
branch-load-misses [hardware cache event]
branch-loads [hardware cache event]
dtlb-load-misses [hardware cache event]
itlb-load-misses [hardware cache event]
br_immed_retired or armv8_pmuv3/br_immed_retired/ [kernel pmu event]
br_mis_pred or armv8_pmuv3/br_mis_pred/ [kernel pmu event]
br_pred or armv8_pmuv3/br_pred/ [kernel pmu event]
bus_access or armv8_pmuv3/bus_access/ [kernel pmu event]
bus_cycles or armv8_pmuv3/bus_cycles/ [kernel pmu event]
cid_write_retired or armv8_pmuv3/cid_write_retired/ [kernel pmu event]
cpu_cycles or armv8_pmuv3/cpu_cycles/ [kernel pmu event]
exc_return or armv8_pmuv3/exc_return/ [kernel pmu event]
exc_taken or armv8_pmuv3/exc_taken/ [kernel pmu event]
inst_retired or armv8_pmuv3/inst_retired/ [kernel pmu event]
l1d_cache or armv8_pmuv3/l1d_cache/ [kernel pmu event]
l1d_cache_refill or armv8_pmuv3/l1d_cache_refill/ [kernel pmu event]
l1d_cache_wb or armv8_pmuv3/l1d_cache_wb/ [kernel pmu event]
l1d_tlb_refill or armv8_pmuv3/l1d_tlb_refill/ [kernel pmu event]
l1i_cache or armv8_pmuv3/l1i_cache/ [kernel pmu event]
l1i_cache_refill or armv8_pmuv3/l1i_cache_refill/ [kernel pmu event]
l1i_tlb_refill or armv8_pmuv3/l1i_tlb_refill/ [kernel pmu event]
l2d_cache or armv8_pmuv3/l2d_cache/ [kernel pmu event]
l2d_cache_refill or armv8_pmuv3/l2d_cache_refill/ [kernel pmu event]
l2d_cache_wb or armv8_pmuv3/l2d_cache_wb/ [kernel pmu event]
ld_retired or armv8_pmuv3/ld_retired/ [kernel pmu event]
mem_access or armv8_pmuv3/mem_access/ [kernel pmu event]
memory_error or armv8_pmuv3/memory_error/ [kernel pmu event]
pc_write_retired or armv8_pmuv3/pc_write_retired/ [kernel pmu event]
st_retired or armv8_pmuv3/st_retired/ [kernel pmu event]
sw_incr or armv8_pmuv3/sw_incr/ [kernel pmu event]
unaligned_ldst_retired or armv8_pmuv3/unaligned_ldst_retired/ [kernel pmu event]
cs_etm// [kernel pmu event]
imx8_ddr0/activate/ [kernel pmu event]
imx8_ddr0/axid-read/ [kernel pmu event]
imx8_ddr0/axid-write/ [kernel pmu event]
imx8_ddr0/cycles/ [kernel pmu event]
imx8_ddr0/hp-read-credit-cnt/ [kernel pmu event]
imx8_ddr0/hp-read/ [kernel pmu event]
imx8_ddr0/hp-req-nocredit/ [kernel pmu event]
imx8_ddr0/hp-xact-credit/ [kernel pmu event]
imx8_ddr0/load-mode/ [kernel pmu event]
imx8_ddr0/lp-read-credit-cnt/ [kernel pmu event]
imx8_ddr0/lp-req-nocredit/ [kernel pmu event]
imx8_ddr0/lp-xact-credit/ [kernel pmu event]
imx8_ddr0/perf-mwr/ [kernel pmu event]
imx8_ddr0/precharge/ [kernel pmu event]
imx8_ddr0/raw-hazard/ [kernel pmu event]
imx8_ddr0/read-accesses/ [kernel pmu event]
imx8_ddr0/read-activate/ [kernel pmu event]
imx8_ddr0/read-command/ [kernel pmu event]
imx8_ddr0/read-cycles/ [kernel pmu event]
imx8_ddr0/read-modify-write-command/ [kernel pmu event]
imx8_ddr0/read-queue-depth/ [kernel pmu event]
imx8_ddr0/read-write-transition/ [kernel pmu event]
imx8_ddr0/read/ [kernel pmu event]
imx8_ddr0/refresh/ [kernel pmu event]
imx8_ddr0/selfresh/ [kernel pmu event]
imx8_ddr0/wr-xact-credit/ [kernel pmu event]
imx8_ddr0/write-accesses/ [kernel pmu event]
imx8_ddr0/write-command/ [kernel pmu event]
imx8_ddr0/write-credit-cnt/ [kernel pmu event]
imx8_ddr0/write-cycles/ [kernel pmu event]
imx8_ddr0/write-queue-depth/ [kernel pmu event]
imx8_ddr0/write/ [kernel pmu event]
branch:
br_cond
[conditional branch executed]
br_cond_mispred
[conditional branch mispredicted]
br_indirect_mispred
[indirect branch mispredicted]
br_indirect_mispred_addr
[indirect branch mispredicted because of address miscompare]
br_indirect_spec
[branch speculatively executed, indirect branch]
bus:
bus_access_rd
[bus access read]
bus_access_wr
[bus access write]
cache:
ext_snoop
[scu snooped data from another cpu for this cpu]
prefetch_linefill
[linefill because of prefetch]
prefetch_linefill_drop
[instruction cache throttle occurred]
read_alloc
[read allocate mode]
read_alloc_enter
[entering read allocate mode]
memory:
ext_mem_req
[external memory request]
ext_mem_req_nc
[non-cacheable external memory request]
other:
exc_fiq
[exception taken, fiq]
exc_irq
[exception taken, irq]
l1d_cache_err
[l1 data cache (data, tag or dirty) memory error, correctable or non-correctable]
l1i_cache_err
[l1 instruction cache (data or tag) memory error]
pre_decode_err
[pre-decode error]
tlb_err
[tlb memory error]
pipeline:
agu_dep_stall
[cycles there is an interlock for a load/store instruction waiting for data to calculate the address in the
agu]
decode_dep_stall
[cycles the dpu iq is empty and there is a pre-decode error being processed]
ic_dep_stall
[cycles the dpu iq is empty and there is an instruction cache miss being processed]
iutlb_dep_stall
[cycles the dpu iq is empty and there is an instruction micro-tlb miss being processed]
ld_dep_stall
[cycles there is a stall in the wr stage because of a load miss]
other_interlock_stall
[cycles there is an interlock other than advanced simd/floating-point instructions or load/store instruction]
other_iq_dep_stall
[cycles that the dpu iq is empty and that is not because of a recent micro-tlb miss, instruction cache miss or
pre-decode error]
simd_dep_stall
[cycles there is an interlock for an advanced simd/floating-point operation]
st_dep_stall
[cycles there is a stall in the wr stage because of a store]
stall_sb_full
[data write operation that stalls the pipeline because the store buffer is full]
rnnn [raw hardware event descriptor]
cpu/t1=v1[,t2=v2,t3 ...]/modifier [raw hardware event descriptor]
(see 'man perf-list' on how to encode it)
mem:[/len][:access] [hardware breakpoint]
metric groups:
no_group:
imx8mp_bandwidth_usage.lpddr4
[bandwidth usage for lpddr4 evk board. unit: imx8_ddr ]
imx8mp_ddr_read.2d
[bytes of gpu 2d read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.3d
[bytes of gpu 3d read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.a53
[bytes of a53 core read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.all
[bytes of all masters read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_dsp
[bytes of audio dsp read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_sdma2_burst
[bytes of audio sdma2_burst read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_sdma2_per
[bytes of audio sdma2_per read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_sdma3_burst
[bytes of audio sdma3_burst read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_sdma3_per
[bytes of audio sdma3_per read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.audio_sdma_pif
[bytes of audio sdma_pif read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.dewarp
[bytes of display dewarp read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.hdmi_hdcp
[bytes of hdmi_tx tx_hdcp read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.hdmi_hrv_mwr
[bytes of hdmi_tx hrv_mwr read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.hdmi_lcdif
[bytes of hdmi_tx lcdif read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.isi1
[bytes of display isi1 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.isi2
[bytes of display isi2 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.isi3
[bytes of display isi3 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.isp1
[bytes of display isp1 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.isp2
[bytes of display isp2 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.lcdif1
[bytes of display lcdif1 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.lcdif2
[bytes of display lcdif2 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.npu
[bytes of npu read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.pci
[bytes of hsio pci read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.supermix
[bytes of supermix(m7) core read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.usb1
[bytes of hsio usb1 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.usb2
[bytes of hsio usb2 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.vpu1
[bytes of vpu1 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.vpu2
[bytes of vpu2 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_read.vpu3
[bytes of vpu3 read from ddr. unit: imx8_ddr ]
imx8mp_ddr_write.2d
[bytes of gpu 2d write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.3d
[bytes of gpu 3d write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.a53
[bytes of a53 core write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.all
[bytes of all masters write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_dsp
[bytes of audio dsp write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_sdma2_burst
[bytes of audio sdma2_burst write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_sdma2_per
[bytes of audio sdma2_per write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_sdma3_burst
[bytes of audio sdma3_burst write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_sdma3_per
[bytes of audio sdma3_per write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.audio_sdma_pif
[bytes of audio sdma_pif write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.dewarp
[bytes of display dewarp write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.hdmi_hdcp
[bytes of hdmi_tx tx_hdcp write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.hdmi_hrv_mwr
[bytes of hdmi_tx hrv_mwr write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.hdmi_lcdif
[bytes of hdmi_tx lcdif write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.isi1
[bytes of display isi1 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.isi2
[bytes of display isi2 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.isi3
[bytes of display isi3 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.isp2
[bytes of display isp2 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.lcdif1
[bytes of display lcdif1 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.lcdif2
[bytes of display lcdif2 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.npu
[bytes of npu write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.pci
[bytes of hsio pci write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.supermix
[bytes of supermix(m7) write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.usb1
[bytes of hsio usb1 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.usb2
[bytes of hsio usb2 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.vpu1
[bytes of vpu1 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.vpu2
[bytes of vpu2 write to ddr. unit: imx8_ddr ]
imx8mp_ddr_write.vpu3
[bytes of vpu3 write to ddr. unit: imx8_ddr ]
imx8_ddr_ddr_mon:
imx8mp_ddr_write.isp1
[bytes of display isp1 write to ddr. unit: imx8_ddr ]
stat我们可以使用stat来采集程序的运行时间和cpu开销，perf stat所支持的主要参数如下：
-a, --all-cpus system-wide collection from all cpus
-a, --no-aggr disable cpu count aggregation
-b, --big-num print large numbers with thousands' separators
-c, --cpu list of cpus to monitor in system-wide
-d, --delay ms to wait before starting measurement after program start (-1: start with events disabled)
-d, --detailed detailed run - start a lot of events
-e, --event event selector. use 'perf list' to list available events
-g, --cgroup monitor event in cgroup name only
-g, --group put the counters into a counter group
-i, --interval-print
print counts at regular interval in ms (overhead is possible for values
我通过“-f 999”选项，我把采样频率设置为999hz，每秒采样999次。
测试命令：
perf record -f 999 -p 997
然后perf会将记录的数据存储在perf.data中。
reportusage: perf report []
-b, --branch-stack use branch records for per branch histogram filling
-c, --comms
only consider symbols in these comms
-c, --cpu list of cpus to profile
-d, --dsos
only consider symbols in these dsos
-d, --dump-raw-trace dump raw trace in ascii
-f, --fields
output field(s): overhead period sample overhead overhead_sys
overhead_us overhead_guest_sys overhead_guest_us overhead_children
sample period pid comm dso symbol parent cpu socket
srcline srcfile local_weight weight transaction trace
symbol_size dso_size cgroup cgroup_id ipc_null time
dso_from dso_to symbol_from symbol_to mispredict abort
in_tx cycles srcline_from srcline_to ipc_lbr symbol_daddr
dso_daddr locked tlb mem snoop dcacheline symbol_iaddr
phys_daddr
-f, --force don't complain, do it
-g, --call-graph
display call graph (stack chain/backtrace):
print_type: call graph printing style (graph|flat|fractal|folded|none)
threshold: minimum call graph inclusion threshold ()
print_limit: maximum number of call graph entry ()
order: call graph order (caller|callee)
sort_key: call graph sort key (function|address)
branch: include last branch info to call graph (branch)
value: call graph value (percent|period|count)
default: graph,0.5,caller,function,percent
-g, --inverted alias for inverted call graph
-i, --input input file name
-i, --show-info display extended information about perf.data file
-k, --vmlinux vmlinux pathname
-m, --disassembler-style
specify disassembler style (e.g. -m intel for intel syntax)
-m, --modules load module symbols - warning: use only with -k and live kernel
-n, --show-nr-samples
show a column with the number of samples
-p, --parent regex filter to identify parent, see: '--sort parent'
-q, --quiet do not show any message
-s, --sort
sort by key(s): overhead overhead_sys overhead_us overhead_guest_sys
overhead_guest_us overhead_children sample period
pid comm dso symbol parent cpu socket srcline srcfile
local_weight weight transaction trace symbol_size
dso_size cgroup cgroup_id ipc_null time dso_from dso_to
symbol_from symbol_to mispredict abort in_tx cycles
srcline_from srcline_to ipc_lbr symbol_daddr dso_daddr
locked tlb mem snoop dcacheline symbol_iaddr phys_daddr
-s, --symbols
only consider these symbols
-t, --field-separator
separator for columns, no spaces will be added between columns '.' is reserved.
-t, --threads show per-thread event counters
-u, --hide-unresolved
only display entries resolved to a symbol
-v, --verbose be more verbose (show symbol address, etc)
-w, --column-widths
don't try to adjust column width, use these fixed values
-x, --exclude-other only display entries with parent-match
--asm-raw display raw encoding of assembly instructions (default)
--branch-history add last branch records to call history
--children accumulate callchains of children and show total overhead as well. enabled by default, use --no-children to disable.
--demangle disable symbol demangling
--demangle-kernel
enable kernel symbol demangling
--full-source-path
show full source file name path for source lines
--group show event group information together
--group-sort-idx
sort the output by the event at the index n in group. if n is invalid, sort by the first event. warning: should be used on grouped events.
--gtk use the gtk2 interface
--header show data header.
--header-only show only data header.
--hierarchy show entries in a hierarchy
--ignore-callees
ignore callees of these functions in call graphs
--ignore-vmlinux don't load vmlinux even if found
--inline show inline function
--itrace[=]
instruction tracing options
i[period]: synthesize instructions events
b: synthesize branches events (branch misses for arm spe)
c: synthesize branches events (calls only)
r: synthesize branches events (returns only)
x: synthesize transactions events
w: synthesize ptwrite events
p: synthesize power events
o: synthesize other events recorded due to the use
of aux-output (refer to perf record)
e[flags]: synthesize error events
each flag must be preceded by + or -
error flags are: o (overflow)
l (data lost)
d[flags]: create a debug log
each flag must be preceded by + or -
log flags are: a (all perf events)
f: synthesize first level cache events
m: synthesize last level cache events
t: synthesize tlb events
a: synthesize remote access events
g[len]: synthesize a call chain (use with i or x)
g[len]: synthesize a call chain on existing event records
l[len]: synthesize last branch entries (use with i or x)
l[len]: synthesize last branch entries on existing event records
snumber: skip initial number of events
q: quicker (less detailed) decoding
period[ns|us|ms|i|t]: specify period to sample stream
concatenate multiple options. default is ibxwpe or cewp
--kallsyms
kallsyms pathname
--max-stack set the maximum stack depth when parsing the callchain, anything beyond the specified depth will be ignored. default: kernel.perf_event_max_stack or 127
--mem-mode mem access profile
--mmaps display recorded tasks memory maps
--ns show times in nanosecs
--objdump objdump binary to use for disassembly and annotations
--percent-limit
don't show entries under that percent
--percent-type
set percent type local/global-period/hits
--percentage
how to display percentage of filtered entries
--pid
only consider symbols in these pids
--prefix
add prefix to source file path names in programs (with --prefix-strip)
--prefix-strip
strip first n entries of source file path name in programs (with --prefix)
--pretty pretty printing style key: normal raw
--raw-trace show raw trace event output (do not use print fmt or plugins)
--samples number of samples to save per histogram entry for individual browsing
--show-cpu-utilization
show sample percentage for different cpu modes
--show-on-off-events
show the on/off switch events, used with --switch-on and --switch-off
--show-ref-call-graph
show callgraph from reference event
--show-total-period
show a column with the sum of periods
--socket-filter
only show processor socket that match with this filter
--source interleave source code with assembly code (default)
--stats display event stats
--stdio use the stdio interface
--stdio-color
'always' (default), 'never' or 'auto' only applicable to --stdio mode
--stitch-lbr enable lbr callgraph stitching approach
--switch-off
stop considering events after the ocurrence of this event
--switch-on
consider events after the ocurrence of this event
--symbol-filter
only show symbols that (partially) match with this filter
--symfs
look for files with symbols relative to this directory
--tasks display recorded tasks
--tid
only consider symbols in these tids
--time time span of interest (start,stop)
--time-quantum
set time quantum for time sort key (default 100ms)
--total-cycles sort all blocks by 'sampled cycles%'
--tui use the tui interface[,tid...]>[,pid...]>|absolute>[,width...]>[,symbol...]>[,key2...]>,threshold[,print_limit],order,sort_key[,branch],value>[,keys...]>[,dso...]>[,comm...]>
采集完数据，我们就可以通过perf report命令寻找采样中的性能瓶颈了。
perf report
samples: 21k of event 'cycles', event count (approx.): 38100133435
overhead command shared object symbol
99.99% test test [.] print •
0.00% test [kernel.kallsyms] [k] update_sd_lb_stats.constprop.0 ▒
0.00% test [kernel.kallsyms] [k] _raw_spin_unlock_irq ▒
0.00% test [kernel.kallsyms] [k] shift_arg_pages ▒
0.00% perf [kernel.kallsyms] [k] perf_event_exec
overhead：指出了该symbol采样在总采样中所占的百分比。在当前场景下，表示了该symbol消耗的cpu时间占总cpu时间的百分比command：进程名shared object：模块名，比如具体哪个共享库，哪个可执行程序。symbol：二进制模块中的符号名，如果是高级语言，比如c语言编写的程序，等价于函数名。只定位到函数还不够好，perf工具还能帮我们定位到更细的粒度，这样我们就不用去猜函数中哪一段代码出了问题。如果我们通过键盘上下键把光标移动到print函数上，然后敲击enter键，perf给出了一些选项。通过这些选项，我们可以进一步分析这个函数。
我们选中第一个选项“annotate wastetime”，我们敲击enter键就可以对函数做进一步分析了。
annotate print --- 分析print函数中指令或者代码的性能
zoom into test thread --- 聚焦到线程 test
zoom into test dso --- 聚焦到动态共享对象test
browse map details --- 查看map
run scripts for samples of thread [test]--- 针对test线程的采样运行脚本
run scripts for samples of symbol [test] --- 针对函数的采样运行脚本
run scripts for all samples --- 针对所有采样运行脚步
switch to another data file in pwd --- 切换到当前目录中另一个数据文件
exit
annotate usage: perf annotate []
-c, --cpu list of cpus to profile
-d, --dsos
only consider symbols in these dsos
-d, --dump-raw-trace dump raw trace in ascii
-f, --force don't complain, do it
-i, --input input file name
-k, --vmlinux vmlinux pathname
-l, --print-line print matching source lines (may be slow)
-m, --disassembler-style
specify disassembler style (e.g. -m intel for intel syntax)
-m, --modules load module symbols - warning: use only with -k and live kernel
-n, --show-nr-samples
show a column with the number of samples
-p, --full-paths don't shorten the displayed pathnames
-q, --quiet do now show any message
-s, --symbol
symbol to annotate[,dso...]>
我们可以使用annotate来单独分析print函数的信息，效果和report中进入annotate一样。
perf annotate -l -s print
top usage: perf top []
-a, --all-cpus system-wide collection from all cpus
-b, --branch-any sample any taken branches
-c, --count event period to sample
-c, --cpu list of cpus to monitor
-d, --delay number of seconds to delay between refreshes
-d, --dump-symtab dump the symbol table used for profiling
-e, --entries display this many functions
-e, --event event selector. use 'perf list' to list available events
-f, --count-filter
only display functions with more events than this
-f, --freq
profile at this frequency
-g enables call-graph recording and display
-i, --no-inherit child tasks do not inherit counters
-j, --branch-filter
branch stack filter modes
-k, --hide_kernel_symbols
hide kernel symbols
-k, --vmlinux vmlinux pathname
-m, --disassembler-style
specify disassembler style (e.g. -m intel for intel syntax)
-m, --mmap-pages
number of mmap data pages
-n, --show-nr-samples
show a column with the number of samples
-p, --pid profile events on existing process id
-r, --realtime collect data with this rt sched_fifo priority
-s, --sort
sort by key(s): pid, comm, dso, symbol, parent, cpu, srcline, ... please refer the man page for the complete list.
-t, --tid profile events on existing thread id
-u, --hide_user_symbols
hide user symbols
-u, --uid user to profile
-v, --verbose be more verbose (show counter open errors, etc)
-w, --column-widths [,width...]>[,key2...]>
perf top命令和linux下的top命令有点相似，实时打印出系统中被采样事件的状态和统计数据。perf top主要用于实时剖析各个函数在某个性能事件(event)上的热度，默认的event是cycles(cpu周期数)，这样可以检测系统中所有应用层和内核层函数的热度。
perf top支持两种输出界面，tui和tty，默认是tui，因为tui需要更多的环境和库支持，所以经常出现乱码问题，所以本文都是基于tty界面分析(–stdio)。
直接执行perf top监控的是整个系统中所有进程的状态，多数情况我们只关心某个进程，或者想定位某个线程的性能问题，perf top都是支持的(-p / -t)。
需要进入函数内部一探究竟，有时对于像上面的dh_ssm_blkbuf_alloc这样的函数的调用堆栈，以定位到是哪里在频繁调用。这时候可以执行：
perf top -t 4010 --stdio -g -k
上面的-g参数就是现实函数的调用堆栈，-k是为了只输出应用层函数
benchbench可以来对系统性能进行评测，支持调度、系统调用、内存、epoll等各项功能测试。
usage:
perf bench [] []
# list of all available benchmark collections:
sched: scheduler and ipc benchmarks
syscall: system call benchmarks
mem: memory access benchmarks
futex: futex stressing benchmarks
epoll: epoll stressing benchmarks
internals: perf-internals benchmarks
all: all benchmarks
如果我们使用perf bench all，会测试所有支持的测试项目。
# running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
total time: 0.900 [sec]
# running sched/pipe benchmark...
# executed 1000000 pipe operations between two processes
total time: 15.180 [sec]
15.180503 usecs/op
65873 ops/sec
# running syscall/basic benchmark...
# executed 10000000 getppid() calls
total time: 3.972 [sec]
0.397209 usecs/op
2517568 ops/sec
# running mem/memcpy benchmark...
# function 'default' (default memcpy() provided by glibc)
# copying 1mb bytes ...
1.698370 gb/sec
# running mem/memset benchmark...
# function 'default' (default memset() provided by glibc)
# copying 1mb bytes ...
12.207031 gb/sec
# running mem/find_bit benchmark...
100000 operations 1 bits set of 1 bits
average for_each_set_bit took: 4638.600 usec (+- 13.761 usec)
average test_bit loop took: 1894.200 usec (+- 2.672 usec)

台积电的下一步是什么？英特尔龙头不保？
诺基亚手机要搭载Flyme6？和魅族合作或已成事实
2020年03期新材料企业家成长营“山东站”标杆企业游学，邀您一起参与！
MIUI9什么时候出？MIUI9最新消息：小米内部已经用上了MIUI9
MediaTek发布天玑820同级最强5G性能
如何使用perf性能分析工具
阿里旗下的蚂蚁金服突然发力，正式进军共享汽车领域
TDK为汽车和物联网应用推出全新MEMS压力传感器
首批省级工业产业链试点示范项目名单公布，华微电子成功入选
工业串行总线的RS的485系统的维护
华为将AI整合到其光伏业务中，将见证云、AI和5G技术的全面融合
人工智能行业将会取得什么成就
选购网络摄像机时须考虑三大因素
工业机器人由些体系构成
曾是小米的颜值担当，如今退让小米6，一代经典无奈没落！
中兴通讯助力中国联通完成5G通信云规模商用部署
一文看懂PLC的作用
浅谈模拟IP的共同特征
比特币的反经济学详细介绍
Marvell助力宇龙酷派智能手机率先通过中国移动入库测试