Loading... # Apple Silicon PMU Counters: Analysis and Implementation Guide ## Overview This document analyzes the Performance Monitoring Unit (PMU) counter system on Apple Silicon processors (M1, M2, and later), based on research into Apple's private `kperf` API. The analysis covers the hardware architecture, counter limitations, compatibility rules, and practical implementation considerations. ## Problem Statement Apple Silicon processors provide PMU counters for tracking microarchitectural events, but Apple does not publicly document: - The maximum number of counters that can be monitored simultaneously - Why certain counters are incompatible with each other - The algorithm for counter allocation - How counter ordering affects compatibility This lack of documentation forces developers to rely on trial-and-error or reverse-engineering to use PMU counters effectively. ## System Analysis ### Fundamental Components ```mermaid graph TD subgraph Hardware A[Fixed Counters] B[Programmable Counters] end subgraph Software C[kperf Framework] D[Counter Database] E[Allocation Algorithm] end subgraph User Space F[Instruments App] G[Custom Tools] end A --> C B --> C D --> E C --> E E --> F E --> G ``` ### PMU Counter Architecture #### Fixed Counters (2) Apple Silicon provides two fixed counters that are always available: - **Cycles** (`FIXED_CYCLES`): Mask `0b0000000001` - **Instructions** (`FIXED_INSTRUCTIONS`): Mask `0b0000000010` These counters have unique bit masks and are compatible with any other counter. #### Programmable Counters (8) The remaining 8 slots are shared among 58 programmable events. These are allocated using a 10-bit mask system where each bit represents a potential counter slot. ### Counter Categories by Mask | Mask Type | Bit Pattern | Counters | Allocation Rule | |-----------|-------------|----------|-----------------| | Fixed Cycle | `0000000001` | 1 | Unique slot | | Fixed Instruction | `0000000010` | 1 | Unique slot | | Group M | `0010000000` | 6 | Single slot (bit 6) | | Group G | `0011100000` | 18 | 3 slots (bits 5-7) | | General | `1111111100` | 33 | 8 slots (bits 2-9) | ### Group M Counters (6 counters - incompatible in pairs) ``` INST_ALL INST_INT_ALU INST_INT_ST INST_LDST INST_SIMD_ALU RETIRE_UOP ``` ### Group G Counters (18 counters - incompatible in quadruples) ``` BRANCH_CALL_INDIR_MISPRED_NONSPEC BRANCH_COND_MISPRED_NONSPEC BRANCH_INDIR_MISPRED_NONSPEC BRANCH_MISPRED_NONSPEC BRANCH_RET_INDIR_MISPRED_NONSPEC INST_BARRIER INST_BRANCH INST_BRANCH_CALL INST_BRANCH_COND INST_BRANCH_INDIR INST_BRANCH_RET INST_BRANCH_TAKEN INST_INT_LD INST_SIMD_LD INST_SIMD_ST L1D_CACHE_MISS_LD_NONSPEC L1D_CACHE_MISS_ST_NONSPEC L1D_TLB_MISS_NONSPEC ``` ## Counter Allocation Algorithm ### Core Principle When adding a counter to the monitoring list: > The counter picks the first available slot starting from the lower bit based on its mask. ### Why Order Matters The allocation algorithm processes counters sequentially. A counter with a wide mask may occupy slots that prevent subsequent counters with specific masks from being allocated. #### Example: Ordering Failure Case ```mermaid graph LR subgraph Initial State B0["Slot 0: Empty"] B1["Slot 1: Empty"] B2["Slot 2: Empty"] B3["Slot 3: Empty"] B4["Slot 4: Empty"] B5["Slot 5: Empty"] B6["Slot 6: Empty"] B7["Slot 7: Empty"] B8["Slot 8: Empty"] B9["Slot 9: Empty"] end ``` Adding counters in this order fails: 1. `L1D_TLB_ACCESS` (mask `1111111100`) - occupies slot 2 2. `L1D_TLB_MISS` (mask `1111111100`) - occupies slot 3 3. `L1D_CACHE_MISS_ST` (mask `1111111100`) - occupies slot 4 4. `L1D_CACHE_MISS_LD` (mask `1111111100`) - occupies slot 5 5. `LD_UNIT_UOP` (mask `1111111100`) - occupies slot 6 6. `ST_UNIT_UOP` (mask `1111111100`) - occupies slot 7 7. `INST_LDST` (mask `0010000000`) - needs slot 6, but it's occupied #### Solution: Reorder Counters Swap `ST_UNIT_UOP` and `INST_LDST`: 1-5. Same as above 6. `INST_LDST` (mask `0010000000`) - occupies slot 6 7. `ST_UNIT_UOP` (mask `1111111100`) - occupies slot 8 (skips occupied slot 6) ### Recommended Ordering Strategy For predictable behavior, add counters in ascending order by mask: 1. Fixed counters first (`FIXED_CYCLES`, `FIXED_INSTRUCTIONS`) 2. Group M counters (single-slot, mask `0010000000`) 3. Group G counters (three-slot, mask `0011100000`) 4. General counters (wide mask, `1111111100`) ```mermaid graph TD A[Start] --> B{Add Fixed Counters} B --> C{Add Group M Counters} C --> D{Add Group G Counters} D --> E{Add General Counters} E --> F[Complete] B -->|Error| G[Allocation Failed] C -->|Error| G D -->|Error| G E -->|Error| G ``` ## Implementation Considerations ### kperf API Structures The `kpep_event` structure contains the critical `mask` field: ```c typedef struct kpep_event { const char *name; const char *description; const char *errata; const char *alias; // e.g., "Instructions", "Cycles" const char *fallback; u32 mask; // Critical for compatibility u8 number; u8 umask; u8 reserved; u8 is_fixed; } kpep_event; ``` ### Key Constraints | Constraint | Value | Notes | |------------|-------|-------| | Maximum counters | 10 | Based on 10-bit mask width | | Fixed counters | 2 | Always available | | Programmable slots | 8 | Shared among 58 events | | Privileges required | sudo | kperf requires root access | ## Practical Tools ### Lauka A custom tool created as a result of this research: - Forked from the `poop` tool by Andrew Kelly - Incorporates `kperf` reverse-engineering by ibireme - Apple Silicon only (M1, M2, and later) - Features: - Select events to monitor - Display all available events - Warming up capability - Proper counter ordering ### Example Output ``` measurement mean ± σ min … max wall_time 591ms ± 7.6ms 583ms … 605ms peak_rss 137MB ± 0.3MB 136.6MB … 137.4MB core_active_cycle 2.51G ± 22.1M 2.48G … 2.54G inst_all 3.62G ± 23.9M 2.53G … 3.69G l1d_cache_miss_ld_nonspec 3.58M ± 31.7K 3.54M … 3.63M branch_mispred_nonspec 21.4M ± 58.2K 21.3M … 21.5M ``` ## Lessons Learned 1. **Research cross-platform**: Linux PMU implementations are better documented and can provide insights applicable to Apple Silicon. 2. **Study reverse-engineered code deeply**: Early thorough analysis of the `kperf` structures would have revealed the mask-based allocation system immediately. 3. **Focus on root causes**: Spending time on combinatorial analysis (18+ million incompatible cases) was less productive than understanding the underlying allocation algorithm. 4. **Order matters everywhere**: Counter ordering affects compatibility even in Apple's own Instruments application. ## References - Original article: https://blog.bugsiki.dev/posts/apple-pmu/ - Apple CPU Optimization Guide (requires Apple Developer account) - kperf reverse-engineering: ibireme's work - poop tool: Andrew Kelly - Lauka tool: https://github.com/ (link to be added) 最后修改:2026 年 01 月 12 日 © 允许规范转载 赞 如果觉得我的文章对你有用,请随意赞赏