2026.01.12. Apple Silicon PMU Counters: Analysis and Implementation Guide

博主： admin
发布时间：2026 年 01 月 12 日
89 次浏览
暂无评论
7540字数
分类： News 运维故事

# Apple Silicon PMU Counters: Analysis and Implementation Guide

## Overview

This document analyzes the Performance Monitoring Unit (PMU) counter system on Apple Silicon processors (M1, M2, and later), based on research into Apple's private `kperf` API. The analysis covers the hardware architecture, counter limitations, compatibility rules, and practical implementation considerations.

## Problem Statement

Apple Silicon processors provide PMU counters for tracking microarchitectural events, but Apple does not publicly document:
- The maximum number of counters that can be monitored simultaneously
- Why certain counters are incompatible with each other
- The algorithm for counter allocation
- How counter ordering affects compatibility

This lack of documentation forces developers to rely on trial-and-error or reverse-engineering to use PMU counters effectively.

## System Analysis

### Fundamental Components

```mermaid
graph TD
    subgraph Hardware
        A[Fixed Counters]
        B[Programmable Counters]
    end

subgraph Software
        C[kperf Framework]
        D[Counter Database]
        E[Allocation Algorithm]
    end

subgraph User Space
        F[Instruments App]
        G[Custom Tools]
    end

A --> C
    B --> C
    D --> E
    C --> E
    E --> F
    E --> G
```

### PMU Counter Architecture

#### Fixed Counters (2)
Apple Silicon provides two fixed counters that are always available:
- **Cycles** (`FIXED_CYCLES`): Mask `0b0000000001`
- **Instructions** (`FIXED_INSTRUCTIONS`): Mask `0b0000000010`

These counters have unique bit masks and are compatible with any other counter.

#### Programmable Counters (8)
The remaining 8 slots are shared among 58 programmable events. These are allocated using a 10-bit mask system where each bit represents a potential counter slot.

### Counter Categories by Mask

| Mask Type | Bit Pattern | Counters | Allocation Rule |
|-----------|-------------|----------|-----------------|
| Fixed Cycle | `0000000001` | 1 | Unique slot |
| Fixed Instruction | `0000000010` | 1 | Unique slot |
| Group M | `0010000000` | 6 | Single slot (bit 6) |
| Group G | `0011100000` | 18 | 3 slots (bits 5-7) |
| General | `1111111100` | 33 | 8 slots (bits 2-9) |

### Group M Counters (6 counters - incompatible in pairs)

```
INST_ALL
INST_INT_ALU
INST_INT_ST
INST_LDST
INST_SIMD_ALU
RETIRE_UOP
```

### Group G Counters (18 counters - incompatible in quadruples)

```
BRANCH_CALL_INDIR_MISPRED_NONSPEC
BRANCH_COND_MISPRED_NONSPEC
BRANCH_INDIR_MISPRED_NONSPEC
BRANCH_MISPRED_NONSPEC
BRANCH_RET_INDIR_MISPRED_NONSPEC
INST_BARRIER
INST_BRANCH
INST_BRANCH_CALL
INST_BRANCH_COND
INST_BRANCH_INDIR
INST_BRANCH_RET
INST_BRANCH_TAKEN
INST_INT_LD
INST_SIMD_LD
INST_SIMD_ST
L1D_CACHE_MISS_LD_NONSPEC
L1D_CACHE_MISS_ST_NONSPEC
L1D_TLB_MISS_NONSPEC
```

## Counter Allocation Algorithm

### Core Principle

When adding a counter to the monitoring list:
> The counter picks the first available slot starting from the lower bit based on its mask.

### Why Order Matters

The allocation algorithm processes counters sequentially. A counter with a wide mask may occupy slots that prevent subsequent counters with specific masks from being allocated.

#### Example: Ordering Failure Case

```mermaid
graph LR
    subgraph Initial State
        B0["Slot 0: Empty"]
        B1["Slot 1: Empty"]
        B2["Slot 2: Empty"]
        B3["Slot 3: Empty"]
        B4["Slot 4: Empty"]
        B5["Slot 5: Empty"]
        B6["Slot 6: Empty"]
        B7["Slot 7: Empty"]
        B8["Slot 8: Empty"]
        B9["Slot 9: Empty"]
    end
```

Adding counters in this order fails:
1. `L1D_TLB_ACCESS` (mask `1111111100`) - occupies slot 2
2. `L1D_TLB_MISS` (mask `1111111100`) - occupies slot 3
3. `L1D_CACHE_MISS_ST` (mask `1111111100`) - occupies slot 4
4. `L1D_CACHE_MISS_LD` (mask `1111111100`) - occupies slot 5
5. `LD_UNIT_UOP` (mask `1111111100`) - occupies slot 6
6. `ST_UNIT_UOP` (mask `1111111100`) - occupies slot 7
7. `INST_LDST` (mask `0010000000`) - needs slot 6, but it's occupied

#### Solution: Reorder Counters

Swap `ST_UNIT_UOP` and `INST_LDST`:
1-5. Same as above
6. `INST_LDST` (mask `0010000000`) - occupies slot 6
7. `ST_UNIT_UOP` (mask `1111111100`) - occupies slot 8 (skips occupied slot 6)

### Recommended Ordering Strategy

For predictable behavior, add counters in ascending order by mask:
1. Fixed counters first (`FIXED_CYCLES`, `FIXED_INSTRUCTIONS`)
2. Group M counters (single-slot, mask `0010000000`)
3. Group G counters (three-slot, mask `0011100000`)
4. General counters (wide mask, `1111111100`)

```mermaid
graph TD
    A[Start] --> B{Add Fixed Counters}
    B --> C{Add Group M Counters}
    C --> D{Add Group G Counters}
    D --> E{Add General Counters}
    E --> F[Complete]

B -->|Error| G[Allocation Failed]
    C -->|Error| G
    D -->|Error| G
    E -->|Error| G
```

## Implementation Considerations

### kperf API Structures

The `kpep_event` structure contains the critical `mask` field:

```c
typedef struct kpep_event {
    const char *name;
    const char *description;
    const char *errata;
    const char *alias;        // e.g., "Instructions", "Cycles"
    const char *fallback;
    u32 mask;                 // Critical for compatibility
    u8 number;
    u8 umask;
    u8 reserved;
    u8 is_fixed;
} kpep_event;
```

### Key Constraints

| Constraint | Value | Notes |
|------------|-------|-------|
| Maximum counters | 10 | Based on 10-bit mask width |
| Fixed counters | 2 | Always available |
| Programmable slots | 8 | Shared among 58 events |
| Privileges required | sudo | kperf requires root access |

## Practical Tools

### Lauka

A custom tool created as a result of this research:
- Forked from the `poop` tool by Andrew Kelly
- Incorporates `kperf` reverse-engineering by ibireme
- Apple Silicon only (M1, M2, and later)
- Features:
  - Select events to monitor
  - Display all available events
  - Warming up capability
  - Proper counter ordering

### Example Output

```
measurement                 mean ± σ          min … max
wall_time                   591ms ± 7.6ms     583ms … 605ms
peak_rss                    137MB ± 0.3MB     136.6MB … 137.4MB
core_active_cycle           2.51G ± 22.1M     2.48G … 2.54G
inst_all                    3.62G ± 23.9M     2.53G … 3.69G
l1d_cache_miss_ld_nonspec   3.58M ± 31.7K     3.54M … 3.63M
branch_mispred_nonspec      21.4M ± 58.2K     21.3M … 21.5M
```

## Lessons Learned

1. **Research cross-platform**: Linux PMU implementations are better documented and can provide insights applicable to Apple Silicon.

2. **Study reverse-engineered code deeply**: Early thorough analysis of the `kperf` structures would have revealed the mask-based allocation system immediately.

3. **Focus on root causes**: Spending time on combinatorial analysis (18+ million incompatible cases) was less productive than understanding the underlying allocation algorithm.

4. **Order matters everywhere**: Counter ordering affects compatibility even in Apple's own Instruments application.

## References

- Original article: https://blog.bugsiki.dev/posts/apple-pmu/
- Apple CPU Optimization Guide (requires Apple Developer account)
- kperf reverse-engineering: ibireme's work
- poop tool: Andrew Kelly
- Lauka tool: https://github.com/ (link to be added)

最后修改：2026 年 01 月 12 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

2026.01.12. Apple Silicon PMU Counters: Analysis and Implementation Guide

admin • 2026 年 01 月 12 日

# Apple Silicon PMU Counters: Analysis and Implementation Guide

## Overview

## Problem Statement

This lack of documentation forces developers to rely on trial-and-error or reverse-engineering to use PMU counters effectively.

## System Analysis

### Fundamental Components

```mermaid
graph TD
    subgraph Hardware
        A[Fixed Counters]
        B[Programmable Counters]
    end

subgraph Software
        C[kperf Framework]
        D[Counter Database]
        E[Allocation Algorithm]
    end

subgraph User Space
        F[Instruments App]
        G[Custom Tools]
    end

A --> C
    B --> C
    D --> E
    C --> E
    E --> F
    E --> G
```

### PMU Counter Architecture

These counters have unique bit masks and are compatible with any other counter.

#### Programmable Counters (8)
The remaining 8 slots are shared among 58 programmable events. These are allocated using a 10-bit mask system where each bit represents a potential counter slot.

### Counter Categories by Mask

### Group M Counters (6 counters - incompatible in pairs)

```
INST_ALL
INST_INT_ALU
INST_INT_ST
INST_LDST
INST_SIMD_ALU
RETIRE_UOP
```

### Group G Counters (18 counters - incompatible in quadruples)

## Counter Allocation Algorithm

### Core Principle

When adding a counter to the monitoring list:
> The counter picks the first available slot starting from the lower bit based on its mask.

### Why Order Matters

The allocation algorithm processes counters sequentially. A counter with a wide mask may occupy slots that prevent subsequent counters with specific masks from being allocated.

#### Example: Ordering Failure Case

#### Solution: Reorder Counters

Swap `ST_UNIT_UOP` and `INST_LDST`:
1-5. Same as above
6. `INST_LDST` (mask `0010000000`) - occupies slot 6
7. `ST_UNIT_UOP` (mask `1111111100`) - occupies slot 8 (skips occupied slot 6)

### Recommended Ordering Strategy

```mermaid
graph TD
    A[Start] --> B{Add Fixed Counters}
    B --> C{Add Group M Counters}
    C --> D{Add Group G Counters}
    D --> E{Add General Counters}
    E --> F[Complete]

B -->|Error| G[Allocation Failed]
    C -->|Error| G
    D -->|Error| G
    E -->|Error| G
```

## Implementation Considerations

### kperf API Structures

The `kpep_event` structure contains the critical `mask` field:

### Key Constraints

## Practical Tools

### Lauka

### Example Output

## Lessons Learned

1. **Research cross-platform**: Linux PMU implementations are better documented and can provide insights applicable to Apple Silicon.

2. **Study reverse-engineered code deeply**: Early thorough analysis of the `kperf` structures would have revealed the mask-based allocation system immediately.

3. **Focus on root causes**: Spending time on combinatorial analysis (18+ million incompatible cases) was less productive than understanding the underlying allocation algorithm.

4. **Order matters everywhere**: Counter ordering affects compatibility even in Apple's own Instruments application.

## References

2026.01.12. Apple Silicon PMU Counters: Analysis and Implementation Guide

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

搭建国内LabHub

CentOS 7.9 编译并使用rpm方式升级openssh9.6p1（包括后续更新9.8p1等）

一天从 redis 大 key 开始

安装eve-ng

重装ensp

工程师的工具焦虑与学习陷阱

ResDownloader 跨平台资源下载器技术分析

Pixel 手机在加密货币社区的兴起：安全认知的转变趋势

ytDownloader 跨平台视频下载工具技术分析

fdfs为啥不能用/etc/rc.d/rc-local启动

2026.01.12. Apple Silicon PMU Counters: Analysis and Implementation Guide

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

2026.01.12. Apple Silicon PMU Counters: Analysis and Implementation Guide

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款