PostPosted: Sat May 11, 2019 8:46 am
by LordCrimson
We've been seeing multiple HPMC-panics lately

Code: Select all
Kernel panic - not syncing High Priority Machine Check (HPMC)

So this got me wondering how the kernel detects a bus check HPMC - what is the HPMC handler watching for?

A hardware crash event can be High Priority Machine Check (HPMC), Low Priority Machine Check (LPMC) or Transfer of Control (TOC). The machine checks are typically caused by hardware malfunctions or certain classes of bus errors. TOC on the other hand is usually initiated by the operator in response to system software being stuck in an error state.

When a hardware crash event occurs, the processor immediately branch to PDC entry point; PDCE_CHECK for HPMC and LPMC faults, and PDCE_TOC for TOC. *The implementation details of these PDC entry points are processor dependant.* Fundamentally they save the processorâ s state (general, control, space and interruption registers) into Processor Internal Memory (PIM). The processor then vectors back into the operating system entry points; HPMC_Vector or TOC_Vector. These entry points are defined in the IVA (Interruption Vector Table) and MEM_TOC in Page Zero respectively.
On entry into the kernel, a crash event entry is created. The operating system makes a pdc call (PDC_PIM) to read the processorâ s state information from PIM into a Restart Parameter Block (RPB). As such the RPB structure contains information pertinent to the understanding of the crash. For example, the Program Counter (PC) in the RPB would indicate what routine was executing at the time of HPMC/TOC event. Once the state has been saved, the operating system continues to dump physical memory to the dump device.

hardware triggered panic

PostPosted: Sat May 11, 2019 8:56 am
by madame
On the PA-RISC architecture, the High-Priority Machine Check is the most serious interrupt signal that exists, and it's triggered when the hardware detects an illegal operation whose route cause may be:
  • a timing problem with the PCI-interface
  • a driver problem which access to address where it should not, and which causes the HPMC.
When an error is detected, the piece of hardware activates a signal that triggers a "Group 1 interrupt" on the CPU(s), and the firmware writes a copy of the registers into NVRAM, so on the next reboot of the machine, you can recover the saved HPMC data from the firmware prompt.

a copy of the registers is saved into NVRAM

PostPosted: Sat May 11, 2019 9:29 am
by madame
The command ser pim at the firmware prompt will usually bring up the info needed.

Code: Select all
Main Menu: Enter command > ser pim

HPMC PIM Analysis Information:

After having captured the output, you can issue a ser clearpim command to clear the saved data.

Code: Select all
Main Menu: Enter command > ser clearpim

Clearing PIM...

This will ensure that you do not accidentally pick up stale data the next time around :smile-smoke:

in doubt, ram check

PostPosted: Sun May 12, 2019 2:30 pm
by madame
8 GByte, 8 modules × 1 GByte each

Code: Select all
mount -t tmpfs -o size=7G tmpfs /mnt/ramdrive/
dd if=/dev/zero of=/mnt/ramdrive/test.bin
badblocks -swv /mnt/ramdrive/test.bin

Code: Select all
Checking for bad blocks in read-write mode
From block 0 to 7340031
Testing with pattern 0xaa: done
Reading and comparing: done
Testing with pattern 0x55: done
Reading and comparing: done
Testing with pattern 0xff: done
Reading and comparing: done
Testing with pattern 0x00: done
Reading and comparing: done
Pass completed, 0 bad blocks found.