

# z16 SMF 113s – Understanding Processor Cache Counters



z/OS Performance Education, Software, and Managed Service Providers



Creators of Pivotor®

### **Peter Enrico**

Email: Peter.Enrico@EPStrategies.com

Enterprise Performance Strategies, Inc. 3457-53rd Avenue North, #145 Bradenton, FL 34210 <u>http://www.epstrategies.com</u> http://www.pivotor.com

> Voice: 813-435-2297 Mobile: 941-685-6789



Copyright® by SHARE Association Except where otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 license. http://creativecommons.org/licenses/by-nc-nd/3.0/



# Contact, Copyright, and Trademark Notices

### **Questions?**

Send email to Peter at <u>Peter.Enrico@EPStrategies.com</u>, or visit our website at <u>http://www.epstrategies.com</u> or <u>http://www.pivotor.com</u>.

#### **Copyright Notice:**

© Enterprise Performance Strategies, Inc. All rights reserved. No part of this material may be reproduced, distributed, stored in a retrieval system, transmitted, displayed, published or broadcast in any form or by any means, electronic, mechanical, photocopy, recording, or otherwise, without the prior written permission of Enterprise Performance Strategies. To obtain written permission please contact Enterprise Performance Strategies, Inc. Contact information can be obtained by visiting <a href="http://www.epstrategies.com">http://www.epstrategies.com</a>.

#### **Trademarks:**

Enterprise Performance Strategies, Inc. presentation materials contain trademarks and registered trademarks of several companies.

The following are trademarks of Enterprise Performance Strategies, Inc.: Health Check®, Reductions®, Pivotor®

Other trademarks and registered trademarks may exist in this presentation



### • Key Reports to Evaluate z16 Processor Caches

 This presentation will walk through, and explain, several reports that will be useful when evaluating the primary processor cache measurements on the z16 processor. There will be a review of the basic concepts and usage of the processor caches and then show which reports and measurements should be used to assess the effects of processor caches in the z16 environment.



# EPS: We do z/OS performance...

- **Pivotor** z/OS performance reporting and analysis software and services
  - Not just SMF reporting, but analysis-based reporting based on expertise
  - www.pivotor.com
- Education and instruction
  - We teach our z/OS performance workshops all over the world
  - Want a workshop in your area? Just contact me.
- z/OS Performance War Rooms
  - Intense, concentrated, and highly productive on-site performance group discussions, analysis and education
  - Amazing feedback from dozens of past clients
- Information
  - We present around the world and participate in online forums
  - <u>https://www.pivotor.com/content.html</u> <u>https://www.pivotor.com/webinar.html</u>





# z/OS Performance workshops available

### During these workshops you will be analyzing your own data!

- WLM Performance and Re-evaluating Goals
  - February 19-23, 2024
- Parallel Sysplex and z/OS Performance Tuning
  - August 20-21, 2024
- Essential z/OS Performance Tuning
  - October 7-11, 2024
- Also... please make sure you are signed up for our free monthly z/OS educational webinars! (email contact@epstrategies.com)

© Robert Rogers



## Like what you see?

- Free z/OS Performance Educational webinars!
  - The titles for our Summer / Fall 2024 webinars are as follows:
    - ✓ What a z/OS Guy Learned About AWS in 10 Years
    - ✓ Advantages of Multiple Period Service Classes
    - ✓ Understanding z/OS Connect Measurements
    - WLM and SMF 99.1 System Measurements Deeper Dive
    - WLM and SMF 99.2 Service Class Period Measurements Deeper Dive
    - Optimizing Performance at the Speed of Light: Why I/O Avoidance is Even More Important Today
    - Understanding MVS Busy % versus LPAR Busy % versus Physical Busy %
    - Rethinking IBM Software Cost Management Under Tailored Fit Pricing
    - Understanding Page Faults and Their Influence on Uncaptured Time
    - Response Time Goals: Average or Percentiles?
    - Understanding and Using Enclave
- If you want a free cursory review of your environment, let us know!
  - We're always happy to process a day's worth of data and show you the results
  - See also: <u>http://pivotor.com/cursoryReview.html</u>



### Like what you see?

- The z/OS Performance Graphs you see here come from Pivotor
- If you don't see them in your performance reporting tool, or you just want a free cursory performance review of your environment, let us know!
  - We're always happy to process a day's worth of data and show you the results
  - See also: <a href="http://pivotor.com/cursoryReview.html">http://pivotor.com/cursoryReview.html</a>
- We also have a free Pivotor offering available as well
  - 1 System, SMF 70-72 only, 7 Day retention
  - That still encompasses over 100 reports!

 All Charts (132 reports, 258 charts) All charts in this reportset.
Charts Warranting Investigation Due to Exception Counts (2 reports, 6 charts, more details) Charts containing more than the threshold number of exceptions
All Charts with Exceptions (2 reports, 8 charts, more details) Charts containing any number of exceptions
Evaluating WLM Velocity Goals (4 reports, 35 charts, more details) This playlist walks through several reports that will be useful in while conducting a WLM velocity goal and



## EPS presentations this week

| What                                                                | Who                           | When      | Where     |
|---------------------------------------------------------------------|-------------------------------|-----------|-----------|
| 60 Years of Pushing Performance Boundaries with the Mainframe       | Scott Chapman                 | Sun 17:00 | Neptune D |
| Introduction to Parallel Sysplex and Data Sharing                   | Peter Enrico                  | Mon 13:15 | Pomona    |
| Macro to Micro: Understanding z/OS Performance Moment by Moment     | Scott Chapman                 | Mon 15:45 | Neptune D |
| WLM Turns 30! : A Retrospective and Lessons Learned                 | Peter Enrico                  | Tue 10:30 | Neptune D |
| PSP: z/OS Performance Spotlight: Some Top Things You May Not Know   | Peter Enrico<br>Scott Chapman | Tue 13:00 | Pomona    |
| More/Slower vs. Fewer/Faster CPUs: Practical Considerations in 2024 | Scott Chapman                 | Tue 14:15 | Neptune D |
| z16 SMF 113s – Understanding Processor Cache Counters               | Peter Enrico                  | Wed 13:15 | Pomona    |



# Why do we care about processor cache measurements and usage of the caching hierarchy?

Instructor: Peter Enrico



• Question:

What are the key influences that result in variations of a particular processor's delivered capacity relative to a customer's environment and workload?

- Answer: As Gary King of IBM would say... there are three key influences:
  - Instruction complexity of one processor family to another
  - Path length of the code executed by customer applications and transactions
  - Usage of the Memory Hierarchy
- A machine's capacity will vary based on each of these three factors

Instructor: Peter Enrico



**Processor Cache** 

## Key Influence - Instruction Complexity

- Notice the PCI column showing the PCIs of the 701 series of each processor
  - PCI : Processor Capacity Index

Term used by IBM instead of MIPS

|      |        |      |           | -   |         |          |           |          | _        |            |         |          |         |          |            |           |
|------|--------|------|-----------|-----|---------|----------|-----------|----------|----------|------------|---------|----------|---------|----------|------------|-----------|
|      |        |      |           |     |         |          | Max per f | irst boo | k-drawer |            |         | Core     | -level  |          | Chip       | Book-dwr  |
| zGen | Name   | Year | Mach Type | GHz | 701 PCI | 701 MSUs | Memory    | CPs      | PU Chips | Cores/chip | L1-Data | L1-Instr | L2-Data | L2-Instr | L3/chip    | L4/bk-dwr |
| z9   | z9 EC  | 2005 | 2094      | 1.7 | 560     | 81       | 128G      | 8        | 8        | 2          | 256K    | 256K     | n/a     | n/a      | n/a        | 40M       |
| z10  | z10 EC | 2008 | 2097      | 4.4 | 902     | 115      | 384G      | 12       | 5        | 4          | 128K    | 64K      | 31      | М        | n/a        | 48M       |
| z11  | z196   | 2010 | 2817      | 5.2 | 1202    | 150      | 704G      | 15       | 6        | 4          | 128K    | 64K      | 1.5     | 5M       | 24M        | 192M      |
| z12  | zEC12  | 2012 | 2827      | 5.5 | 1514    | 188      | 704G      | 20       | 6        | 6          | 96K     | 64K      | 1M      | 1M       | 48M        | 348M      |
| z13  | z13    | 2015 | 2964      | 5   | 1695    | 210      | 2464G     | 30       | 6        | 8          | 128K    | 96K      | 2M      | 2M       | 64M        | 960M      |
| z14  | z14    | 2017 | 3906      | 5.2 | 1832    | 227      | 8000G     | 33       | 6        | 10         | 128K    | 128K     | 4M      | 2M       | 128M       | 672M      |
| z15  | z15    | 2019 | 8561      | 5.2 | 2055    | 253      | 8000G     | 34       | 4        | 12         | 128K    | 128K     | 4M      | 4M       | 256M       | 960M      |
| z16  | z16    | 2022 | 3931      | 5.2 | 2253    | 278      | 9984G     | 39       | 4x2      | 8          | 128K    | 128K     | up to   | 32M      | up to 256M | up to 2G  |

Instructor: Peter Enrico



## Key Influence - Instruction Complexity



Instructor: Peter Enrico



# Key Influence – Path Length

- Path length of the code executed by customer applications and transactions
  - This relates to code executed by applications / jobs / transactions / etc.
  - Instruction count
- The actual path lengths executed by a workload will vary
  - From customer to customer, and from IBM synthetic workloads versus customer
  - From one customer's application environment versus another application environment of that same customer
    - Example: CICS / DB2 application versus a WAS / DB2 application
- Is sensitive to the configuration due to MP effects
  - Higher n-ways or difference in configuration may increase path lengths execute (which in turn influences the processor capacity relative to LSPRs)
    - Example: May have more locking in a higher MP environment, or queues may be longer, etc.
- But when move from one processor to another this generally does not change much for a specific customer
  - Whether the move is from one processor family to another
  - Or from one process in the same family to another

Instructor: Peter Enrico



# Key Influence – Memory Hierarchy

- Usage of the Memory Hierarchy
  - Heavily influenced by key factors result potentially wide variations in realized capacity
  - From one processor family to another there are many design alternatives
    - Levels of cache, scope of cache, latency, etc.
  - Configuration will influence usage of the memory hierarchy
    - LPAR configuration, competition between LPARs, options such as HiperDispatch, etc.
  - Exploitation by workloads will influence usage of the memory hierarchy
    - Transaction intensity, memory intensity, I/O intensity, application mixtures, competition of resource by applications, etc.
  - z/OS performance management and options
    - WLM management of resources, affinity nodes, IEAOPTxx opts, heap sizes, initiators, etc.
- Final result is that usage of memory hierarchy heavily influences a processor's delivered capacity and performance.
  - Workload performance sensitive to how deep into the memory hierarchy the processor must go to retrieve instructions and data
- So, for processor sizing, LSPRs have started focusing on this

Instructor: Peter Enrico



## Case Study CEC LSPRs: z14 vs z16

|     | Processor | #CP | PCI** | MSU*** | Low*  | Average* | High* |
|-----|-----------|-----|-------|--------|-------|----------|-------|
| z14 | 3906-609  | 9   | 8142  | 997    | 15.99 | 14.55    | 12.79 |
|     |           |     |       |        |       |          |       |
|     | Processor | #CP | PCI** | MSU*** | Low*  | Average* | High* |
| z16 | 3931-606  | 6   | 8006  | 980    | 14.92 | 14.3     | 13.01 |

| L1MP     | RNI        | Workload<br>Hint |
|----------|------------|------------------|
| <3%      | >= 0.75    | AVERAGE          |
|          | < 0.75     | LOW              |
| 3% to 6% | >1.0       | HIGH             |
|          | 0.6 to 1.0 | AVERAGE          |
|          | < 0.6      | LOW              |
| >6%      | >=0.75     | HIGH             |
|          | < 0.75     | AVERAGE          |

Instructor: Peter Enrico



# Introducing the Processor Caches of IBM's zArchitecture Processors

Instructor: Peter Enrico

# Modern Performance Optimization

- Distance matters!
- Keep data close to not just the processor, but close to the instruction units on the processor
  - I.E. L1 cache hits very important
  - If the data isn't in L1, hopefully it's in L2, L3 or L4
- Hardware Instrumentation Services (HIS) records processor efficiency metrics in SMF 113 records
  - Be sure to record these
- SMF 99.14 records record mapping of logical to physical cores
  - Of particular interest for multi-book machines to make sure LPARs aren't crossing books



OTOVIO



# z15 Cache Summary

Instructor: Peter Enrico

## z15 CPU Core

- All processor types (GCP, zIIP, ICF, IFL, SAP, IFP) use the exact same physical core
  - Microcode limits what the individual logical CPs can do
- Note how much area is given to L2 cache
  - 4MB instr, 4MB data
- L1 is in LSU
- 5.2 GHz (0.192ns cycle time)
  - ~5,200,000 cycles in 1ms
  - ~ 2 inches: light speed

ICM: Instr Cache/MergeLSU: Load/StoreIDU: Instr DecodeXU: TLB/DATIFB: Instr fetch/branch predictFXU: Fixed pointISU: Instr SequenceVFU: Vector / FP

LSU: Load/Store XU: TLB/DAT FXU: Fixed point VFU: Vector / FP MA: Modulo Arithmetic for Eliptic Curve Crypto RU: Recovery Unit COP: CoProc PC+TP: HIS / error collection, trap



© Enterprise Performance Strategies





## z15 Processor Unit (PU) Chip

- This is one z15 PU (Processor Unit) Chip
  - About 1" square (25.3mmx27.5mm)
  - 9.2B transistors
- 4 chips per drawer
- 12 cores (9, 10, or 11 "active") per chip
  - 41 active cores per drawer < Max190
  - 43 active cores per drawer Max190
  - Wafer yields improved by utilizing chips that have some cores disabled
- Notice amount of chip area for L3 cache
  - Note cores rotated to orient L2 near L3
  - Distance matters!
- Note NXU: Nest Acceleration Unit



<sup>©</sup> Enterprise Performance Strategies



# z15 System Controller (SC) Chip

- This is one z14 SC (System Controller) Chip
  - One chip per drawer
- Provides L4 cache
  - 960 MB of cache per SC chip
- Manages communications between PU chips(X-Bus) and drawers (A-Bus)



© IBM

© Enterprise Performance Strategies



# z16 Cache Summary

Content for next few z16 slides are from various IBM presentations

Instructor: Peter Enrico

### z16 Virtual Caches (slide source: IBM)

- What's different from z15
  - There is no L3 physical cache present on the cores
    - There is a new L1 Shadow Cache that will help manage syncing lines with L2
  - There is no SC chip or physical L4 Cache
    - All CPs L2 are interconnected via buses
- How Virtual Caches work
  - L2 Caches of unused cores or underutilized cores will be converted to be used as virtual caches
    - If the core becomes actives the cache will be returned
  - Virtual cache on the same CP will be seen as additional virtual L3 cache to the core
  - Virtual Cache on a different CP on the same drawer will be seen as L4 Cache



Instructor: Peter Enrico



### Z16 – Telum Processor Chip (slide source: IBM)

- Samsung 7nm FinFET Technology
  - 530mm 2 chip (25% smaller than
  - Space for future system growth
- 8 Core Processor Chip
  - 4th Generation SMT Core design
  - On Chip Accelerators: AI, Compression, Coupling, Sort
  - Gen 4 PCIe Interface
  - Memory Interface (up to 2TB per chip, encryption +
- Core L1 Cache
  - Private 128KB L1 I and 128KB L1 D
- Core / Chip L2 Cache
  - 32MB Unified L2 cache shared by I & D (4x capacity, 19
  - 4 independent pipelines for fetch/store traffic (320GB/
  - Precise tracking of L1 content (reduces MP
- Scalable high speed, low latency ring interconnect (320GB/



#### **Telum Processor Chip**



### z16 – Virtual L3 and L4 Caches (slide source: IBM)



- Virtual Cache Layers built from L2 Cache blocks
  - Mirrors physical hierarchy of prior designs
- Base Virtual Cache design behavior
  - Private L2 Cache when processor is active (dynamic, 16MB)
  - Virtual L3 Cache when processor is inactive (victim)
  - Virtual L4 Cache when processor chip is inactive (victim)
- Scales as additional processors are brought online
  - L2 Caches switch from Victim L3 to Private L2 behavior
  - Similar to Cache Inclusivity tax of prior designs
- Logical Hierarchy remains
  - 1.5x more cache per core at vL3, vL4
  - More efficient use of cache array space
  - Overcomes limits of traditional architecture
  - Extendable for future generations



#### Logical View



- L1: 128KB IL1, 128KB DL1 8w IL1, 8w DL1
- L2: 32MB Shared-Victim 16w Set Associative
- L3: 256MB Virtual Victim Cache 128w Set Associative
- L4: 2GBGB Virtual Victim Cache 1024w Set Associative

Instructor: Peter Enrico

# z16 Virtual Cache Provisioning



• One chip example (just to make the point)



Instructor: Peter Enrico



IBM z16 Technical Overview\_21

© 2022 IBM Corporation



# z16 Cache Reporting

Content for next few z16 slides are from various IBM presentations

Instructor: Peter Enrico

# z16 Processor Logical Cache Hierarchy



 Although virtualized, measurements are calculated for virtual L3 and L4 caches

#### • Distance Matters!

- Register
- Processor Caches
  - L1 Cache
  - L2 Cache
  - L3 Cache
  - L4 Cache
    - Local
    - Remote
  - Real Memory (i.e. Central Storage)
- Hardware Instrumentation Services (HIS) records processor efficiency metrics in SMF 113 records



Enterprise Performance Strategies, Inc. ©

Instructor: Peter Enrico



## Case Study CEC LSPRs: z14 vs z16

- z14 (3906-609 M02)
- z16 (3931-606 A01)

|     | Processor | #CP | PCI** | <b>MSU***</b> | Low*  | Average* | High* |
|-----|-----------|-----|-------|---------------|-------|----------|-------|
| z14 | 3906-609  | 9   | 8142  | 997           | 15.99 | 14.55    | 12.79 |
|     |           |     |       |               |       |          |       |
|     | Processor | #CP | PCI** | MSU***        | Low*  | Average* | High* |
| z16 | 3931-606  | 6   | 8006  | 980           | 14.92 | 14.3     | 13.01 |

Instructor: Peter Enrico



## z14 vs z16 SYS2 config

- z14 (3906-609 M02)
  - 9 CPs, 4 zIIPs
  - SYS2: 4 CPs, 4 zIIPs



- z16 (3931-606 A01)
  - 6CPs, 4 zIIPs
  - SYS2: 4 CPs, 4 zIIPs



Instructor: Peter Enrico



## z14 vs z16 – Cache Sourcing

#### Notice the improved sourcing from L2 since L2 caches are much larger



Instructor: Peter Enrico

# There are 4 key cache values (Pivotor Example) Zeps 🥤



Instructor: Peter Enrico

Enterprise Performance Strategies, Inc. ©

4 key calculated (CP CPU example):

- CPI: Cycles per Instruction
  - Used to gauge processor contention, as well as instruction mix consistency

ROTON

- L1MP: L1 misses per 100 instructions
  - Think of this as a 'miss percentage'
- RNI: Relative Next Intensity
  - Workload 'signature' that gauges the pressures being put on the upper caches and memory
- TLB Miss CPU Percentage
  - Total percent of the CPU consumed by the LPAR that goes to dynamic address translation (DAT) due to a translation look-aside buffer miss

# RNI – Breakdown by Cache





Instructor: Peter Enrico





### L1MP – z14 vs z16 (L1 Misses per 100 Instr)

This example shows not that much of a difference for L1 Misses per 100 instructions. Reminder, L1MP is not a performance metric to be tuned, but rather a 'signature' of a customer's workloads relative to the LSPRs.



Instructor: Peter Enrico



## CPI – z14 vs z15 (Cycles per Instruction)

This example shows about a 1 cycle per instruction improvement. Probably mostly due to larger L2 cache



Instructor: Peter Enrico

### TLB Miss CPU % - z14 vs z15



This example shows TLB CPU Miss % to be about the same at about 2%. This represents pure CPU consumption of Dynamic Address Space (DAT) translation when there is a cache miss in the Translation Lookaside Buffer).



Instructor: Peter Enrico



### Summary

- The purpose of this presentation was to show some useful caching reports for the z16 processor
- A great deal still needs to be learned
- In addition, the following still needs to be understood, measured, and explained:
  - Current metrics ape measuring the physical cache structure of processors prior to the z16
  - Is it possible to measure the cost of virtualization of the caches
    - Example: cost L4 on chip vs L4 off chip
    - Example: cost of provisioning of the underutilized or inactive L2 to a core's L3 or L4

Instructor: Peter Enrico



# Comments from Jamie... and then Q & A

Questions about content of webinar?

Of maybe general performance questions?

Instructor: Peter Enrico