| CVE |
Vendors |
Products |
Updated |
CVSS v3.1 |
| In the Linux kernel, the following vulnerability has been resolved:
Revert "wireguard: device: enable threaded NAPI"
This reverts commit 933466fc50a8e4eb167acbd0d8ec96a078462e9c which is
commit db9ae3b6b43c79b1ba87eea849fd65efa05b4b2e upstream.
We have had three independent production user reports in combination
with Cilium utilizing WireGuard as encryption underneath that k8s Pod
E/W traffic to certain peer nodes fully stalled. The situation appears
as follows:
- Occurs very rarely but at random times under heavy networking load.
- Once the issue triggers the decryption side stops working completely
for that WireGuard peer, other peers keep working fine. The stall
happens also for newly initiated connections towards that particular
WireGuard peer.
- Only the decryption side is affected, never the encryption side.
- Once it triggers, it never recovers and remains in this state,
the CPU/mem on that node looks normal, no leak, busy loop or crash.
- bpftrace on the affected system shows that wg_prev_queue_enqueue
fails, thus the MAX_QUEUED_PACKETS (1024 skbs!) for the peer's
rx_queue is reached.
- Also, bpftrace shows that wg_packet_rx_poll for that peer is never
called again after reaching this state for that peer. For other
peers wg_packet_rx_poll does get called normally.
- Commit db9ae3b ("wireguard: device: enable threaded NAPI")
switched WireGuard to threaded NAPI by default. The default has
not been changed for triggering the issue, neither did CPU
hotplugging occur (i.e. 5bd8de2 ("wireguard: queueing: always
return valid online CPU in wg_cpumask_choose_online()")).
- The issue has been observed with stable kernels of v5.15 as well as
v6.1. It was reported to us that v5.10 stable is working fine, and
no report on v6.6 stable either (somewhat related discussion in [0]
though).
- In the WireGuard driver the only material difference between v5.10
stable and v5.15 stable is the switch to threaded NAPI by default.
[0] https://lore.kernel.org/netdev/CA+wXwBTT74RErDGAnj98PqS=wvdh8eM1pi4q6tTdExtjnokKqA@mail.gmail.com/
Breakdown of the problem:
1) skbs arriving for decryption are enqueued to the peer->rx_queue in
wg_packet_consume_data via wg_queue_enqueue_per_device_and_peer.
2) The latter only moves the skb into the MPSC peer queue if it does
not surpass MAX_QUEUED_PACKETS (1024) which is kept track in an
atomic counter via wg_prev_queue_enqueue.
3) In case enqueueing was successful, the skb is also queued up
in the device queue, round-robin picks a next online CPU, and
schedules the decryption worker.
4) The wg_packet_decrypt_worker, once scheduled, picks these up
from the queue, decrypts the packets and once done calls into
wg_queue_enqueue_per_peer_rx.
5) The latter updates the state to PACKET_STATE_CRYPTED on success
and calls napi_schedule on the per peer->napi instance.
6) NAPI then polls via wg_packet_rx_poll. wg_prev_queue_peek checks
on the peer->rx_queue. It will wg_prev_queue_dequeue if the
queue->peeked skb was not cached yet, or just return the latter
otherwise. (wg_prev_queue_drop_peeked later clears the cache.)
7) From an ordering perspective, the peer->rx_queue has skbs in order
while the device queue with the per-CPU worker threads from a
global ordering PoV can finish the decryption and signal the skb
PACKET_STATE_CRYPTED out of order.
8) A situation can be observed that the first packet coming in will
be stuck waiting for the decryption worker to be scheduled for
a longer time when the system is under pressure.
9) While this is the case, the other CPUs in the meantime finish
decryption and call into napi_schedule.
10) Now in wg_packet_rx_poll it picks up the first in-order skb
from the peer->rx_queue and sees that its state is still
PACKET_STATE_UNCRYPTED. The NAPI poll routine then exits e
---truncated--- |
| A flaw was found in libcap. A local unprivileged user can exploit a Time-of-check-to-time-of-use (TOCTOU) race condition in the `cap_set_file()` function. This allows an attacker with write access to a parent directory to redirect file capability updates to an attacker-controlled file. By doing so, capabilities can be injected into or stripped from unintended executables, leading to privilege escalation. |
| In the Linux kernel, the following vulnerability has been resolved:
drm/msm/dpu: fix mismatch between power and frequency
During DPU runtime suspend, calling dev_pm_opp_set_rate(dev, 0) drops
the MMCX rail to MIN_SVS while the core clock frequency remains at its
original (highest) rate. When runtime resume re-enables the clock, this
may result in a mismatch between the rail voltage and the clock rate.
For example, in the DPU bind path, the sequence could be:
cpu0: dev_sync_state -> rpmhpd_sync_state
cpu1: dpu_kms_hw_init
timeline 0 ------------------------------------------------> t
After rpmhpd_sync_state, the voltage performance is no longer guaranteed
to stay at the highest level. During dpu_kms_hw_init, calling
dev_pm_opp_set_rate(dev, 0) drops the voltage, causing the MMCX rail to
fall to MIN_SVS while the core clock is still at its maximum frequency.
When the power is re-enabled, only the clock is enabled, leading to a
situation where the MMCX rail is at MIN_SVS but the core clock is at its
highest rate. In this state, the rail cannot sustain the clock rate,
which may cause instability or system crash.
Remove the call to dev_pm_opp_set_rate(dev, 0) from dpu_runtime_suspend
to ensure the correct vote is restored when DPU resumes.
Patchwork: https://patchwork.freedesktop.org/patch/710077/ |
| In the Linux kernel, the following vulnerability has been resolved:
net: psp: check for device unregister when creating assoc
psp_assoc_device_get_locked() obtains a psp_dev reference via
psp_dev_get_for_sock() (which uses psp_dev_tryget() under RCU);
it then acquires psd->lock and drops the reference. Before
the lock is taken, psp_dev_unregister() can run to completion:
take psd->lock, clear out state, unlock, drop the registration
reference.
The expectation is that the lock prevents device unregistration,
but much like with netdevs special care has to be taken when
"upgrading" a reference to a locked device. Add the missing
check if device is still alive. psp_dev_is_registered() exists
already but had no callers, which makes me wonder if I either
forgot to add this or lost the check during refactoring... |
| In the Linux kernel, the following vulnerability has been resolved:
PCI: tegra194: Fix CBB timeout caused by DBI access before core power-on
When PERST# is deasserted twice (assert -> deassert -> assert -> deassert),
a CBB (Control Backbone) timeout occurs at DBI register offset 0x8bc
(PCIE_MISC_CONTROL_1_OFF). This happens because pci_epc_deinit_notify()
and dw_pcie_ep_cleanup() are called before reset_control_deassert() powers
on the controller core.
The call chain that causes the timeout:
pex_ep_event_pex_rst_deassert()
pci_epc_deinit_notify()
pci_epf_test_epc_deinit()
pci_epf_test_clear_bar()
pci_epc_clear_bar()
dw_pcie_ep_clear_bar()
__dw_pcie_ep_reset_bar()
dw_pcie_dbi_ro_wr_en() <- Accesses 0x8bc DBI register
reset_control_deassert(pcie->core_rst) <- Core powered on HERE
The DBI registers, including PCIE_MISC_CONTROL_1_OFF (0x8bc), are only
accessible after the controller core is powered on via
reset_control_deassert(pcie->core_rst). Accessing them before this point
results in a CBB timeout because the hardware is not yet operational.
Fix this by moving pci_epc_deinit_notify() and dw_pcie_ep_cleanup() to
after reset_control_deassert(pcie->core_rst), ensuring the controller is
fully powered on before any DBI register accesses occur. |
| In the Linux kernel, the following vulnerability has been resolved:
powerpc/pgtable-frag: Fix bad page state in pte_frag_destroy
powerpc uses pt_frag_refcount as a reference counter for tracking it's
pte and pmd page table fragments. For PTE table, in case of Hash with
64K pagesize, we have 16 fragments of 4K size in one 64K page.
Patch series [1] "mm: free retracted page table by RCU"
added pte_free_defer() to defer the freeing of PTE tables when
retract_page_tables() is called for madvise MADV_COLLAPSE on shmem
range.
[1]: https://lore.kernel.org/all/7cd843a9-aa80-14f-5eb2-33427363c20@google.com/
pte_free_defer() sets the active flag on the corresponding fragment's
folio & calls pte_fragment_free(), which reduces the pt_frag_refcount.
When pt_frag_refcount reaches 0 (no active fragment using the folio), it
checks if the folio active flag is set, if set, it calls call_rcu to
free the folio, it the active flag is unset then it calls pte_free_now().
Now, this can lead to following problem in a corner case...
[ 265.351553][ T183] BUG: Bad page state in process a.out pfn:20d62
[ 265.353555][ T183] page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x20d62
[ 265.355457][ T183] flags: 0x3ffff800000100(active|node=0|zone=0|lastcpupid=0x7ffff)
[ 265.358719][ T183] raw: 003ffff800000100 0000000000000000 5deadbeef0000122 0000000000000000
[ 265.360177][ T183] raw: 0000000000000000 c0000000119caf58 00000000ffffffff 0000000000000000
[ 265.361438][ T183] page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
[ 265.362572][ T183] Modules linked in:
[ 265.364622][ T183] CPU: 0 UID: 0 PID: 183 Comm: a.out Not tainted 6.18.0-rc3-00141-g1ddeaaace7ff-dirty #53 VOLUNTARY
[ 265.364785][ T183] Hardware name: IBM pSeries (emulated by qemu) POWER10 (architected) 0x801200 0xf000006 of:SLOF,git-ee03ae pSeries
[ 265.364908][ T183] Call Trace:
[ 265.364955][ T183] [c000000011e6f7c0] [c000000001cfaa18] dump_stack_lvl+0x130/0x148 (unreliable)
[ 265.365202][ T183] [c000000011e6f7f0] [c000000000794758] bad_page+0xb4/0x1c8
[ 265.365384][ T183] [c000000011e6f890] [c00000000079c020] __free_frozen_pages+0x838/0xd08
[ 265.365554][ T183] [c000000011e6f980] [c0000000000a70ac] pte_frag_destroy+0x298/0x310
[ 265.365729][ T183] [c000000011e6fa30] [c0000000000aa764] arch_exit_mmap+0x34/0x218
[ 265.365912][ T183] [c000000011e6fa80] [c000000000751698] exit_mmap+0xb8/0x820
[ 265.366080][ T183] [c000000011e6fc30] [c0000000001b1258] __mmput+0x98/0x300
[ 265.366244][ T183] [c000000011e6fc80] [c0000000001c81f8] do_exit+0x470/0x1508
[ 265.366421][ T183] [c000000011e6fd70] [c0000000001c95e4] do_group_exit+0x88/0x148
[ 265.366602][ T183] [c000000011e6fdc0] [c0000000001c96ec] pid_child_should_wake+0x0/0x178
[ 265.366780][ T183] [c000000011e6fdf0] [c00000000003a270] system_call_exception+0x1b0/0x4e0
[ 265.366958][ T183] [c000000011e6fe50] [c00000000000d05c] system_call_vectored_common+0x15c/0x2ec
The bad page state error occurs when such a folio gets freed (with
active flag set), from do_exit() path in parallel.
... this can happen when the pte fragment was allocated from this folio,
but when all the fragments get freed, the pte_frag_refcount still had some
unused fragments. Now, if this process exits, with such folio as it's cached
pte_frag in mm->context, then during pte_frag_destroy(), we simply call
pagetable_dtor() and pagetable_free(), meaning it doesn't clear the
active flag. This, can lead to the above bug. Since we are anyway in
do_exit() path, then if the refcount is 0, then I guess it should be
ok to simply clear the folio active flag before calling pagetable_dtor()
& pagetable_free(). |
| In the Linux kernel, the following vulnerability has been resolved:
ice: fix race condition in TX timestamp ring cleanup
Fix a race condition between ice_free_tx_tstamp_ring() and ice_tx_map()
that can cause a NULL pointer dereference.
ice_free_tx_tstamp_ring currently clears the ICE_TX_FLAGS_TXTIME flag
after NULLing the tstamp_ring. This could allow a concurrent ice_tx_map
call on another CPU to dereference the tstamp_ring, which could lead to
a NULL pointer dereference.
CPU A:ice_free_tx_tstamp_ring() | CPU B:ice_tx_map()
--------------------------------|---------------------------------
tx_ring->tstamp_ring = NULL |
| ice_is_txtime_cfg() -> true
| tstamp_ring = tx_ring->tstamp_ring
| tstamp_ring->count // NULL deref!
flags &= ~ICE_TX_FLAGS_TXTIME |
Fix by:
1. Reordering ice_free_tx_tstamp_ring() to clear the flag before
NULLing the pointer, with smp_wmb() to ensure proper ordering.
2. Adding smp_rmb() in ice_tx_map() after the flag check to order the
flag read before the pointer read, using READ_ONCE() for the
pointer, and adding a NULL check as a safety net.
3. Converting tx_ring->flags from u8 to DECLARE_BITMAP() and using
atomic bitops (set_bit(), clear_bit(), test_bit()) for all flag
operations throughout the driver:
- ICE_TX_RING_FLAGS_XDP
- ICE_TX_RING_FLAGS_VLAN_L2TAG1
- ICE_TX_RING_FLAGS_VLAN_L2TAG2
- ICE_TX_RING_FLAGS_TXTIME |
| In the Linux kernel, the following vulnerability has been resolved:
f2fs: fix data loss caused by incorrect use of nat_entry flag
Data loss can occur when fsync is performed on a newly created file
(before any checkpoint has been written) concurrently with a checkpoint
operation. The scenario is as follows:
create & write & fsync 'file A' write checkpoint
- f2fs_do_sync_file // inline inode
- f2fs_write_inode // inode folio is dirty
- f2fs_write_checkpoint
- f2fs_flush_merged_writes
- f2fs_sync_node_pages
- f2fs_flush_nat_entries
- f2fs_fsync_node_pages // no dirty node
- f2fs_need_inode_block_update // return false
SPO and lost 'file A'
f2fs_flush_nat_entries() sets the IS_CHECKPOINTED and HAS_LAST_FSYNC
flags for the nat_entry, but this does not mean that the checkpoint has
actually completed successfully. However, f2fs_need_inode_block_update()
checks these flags and incorrectly assumes that the checkpoint has
finished.
The root cause is that the semantics of IS_CHECKPOINTED and
HAS_LAST_FSYNC are only guaranteed after the checkpoint write fully
completes.
This patch modifies f2fs_need_inode_block_update() to acquire the
sbi->node_write lock before reading the nat_entry flags, ensuring that
once IS_CHECKPOINTED and HAS_LAST_FSYNC are observed to be set, the
checkpoint operation has already completed. |
| Budibase is an open-source low-code platform. Prior to 3.39.9, authenticated users with automation permissions can bypass Budibase's SSRF blacklist through DNS rebinding. The outbound fetch flow validates a hostname against the blacklist before the request is sent, but the actual socket connection later performs a separate DNS lookup through node-fetch. Since the validated IPs are never pinned to the connection, an attacker-controlled hostname can return a public IP during validation and a private/internal IP during the real connection. This results in a non-blind SSRF primitive against internal services reachable from the Budibase host, including loopback, RFC1918 ranges, and cloud metadata endpoints. This vulnerability is fixed in 3.39.9. |
| In the Linux kernel, the following vulnerability has been resolved:
powerpc/64s: Fix unmap race with PMD migration entries
The following race is possible with migration swap entries or
device-private THP entries. e.g. when move_pages is called on a PMD THP
page, then there maybe an intermediate state, where PMD entry acts as
a migration swap entry (pmd_present() is true). Then if an munmap
happens at the same time, then this VM_BUG_ON() can happen in
pmdp_huge_get_and_clear_full().
This patch fixes that.
Thread A: move_pages() syscall
add_folio_for_migration()
mmap_read_lock(mm)
folio_isolate_lru(folio)
mmap_read_unlock(mm)
do_move_pages_to_node()
migrate_pages()
try_to_migrate_one()
spin_lock(ptl)
set_pmd_migration_entry()
pmdp_invalidate() # PMD: _PAGE_INVALID | _PAGE_PTE | pfn
set_pmd_at() # PMD: migration swap entry (pmd_present=0)
spin_unlock(ptl)
[page copy phase] # <--- RACE WINDOW -->
Thread B: munmap()
mmap_write_downgrade(mm)
unmap_vmas() -> zap_pmd_range()
zap_huge_pmd()
__pmd_trans_huge_lock()
pmd_is_huge(): # !pmd_present && !pmd_none -> TRUE (swap entry)
pmd_lock() -> # spin_lock(ptl), waits for Thread A to release ptl
pmdp_huge_get_and_clear_full()
VM_BUG_ON(!pmd_present(*pmdp)) # HITS!
[ 287.738700][ T1867] ------------[ cut here ]------------
[ 287.743843][ T1867] kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:187!
cpu 0x0: Vector: 700 (Program Check) at [c00000044037f4f0]
pc: c000000000094ca4: pmdp_huge_get_and_clear_full+0x6c/0x23c
lr: c000000000645dec: zap_huge_pmd+0xb0/0x868
sp: c00000044037f790
msr: 800000000282b033
current = 0xc0000004032c1a00
paca = 0xc000000004fe0000 irqmask: 0x03 irq_happened: 0x09
pid = 1867, comm = a.out
kernel BUG at :187!
Linux version 6.19.0-12136-g14360d4f917c-dirty (powerpc64le-linux-gnu-gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #27 SMP PREEMPT Sun Feb 22 10:38:56 IST 2026
enter ? for help
[link register ] c000000000645dec zap_huge_pmd+0xb0/0x868
[c00000044037f790] c00000044037f7d0 (unreliable)
[c00000044037f7d0] c000000000645dcc zap_huge_pmd+0x90/0x868
[c00000044037f840] c0000000005724cc unmap_page_range+0x176c/0x1f40
[c00000044037fa00] c000000000572ea0 unmap_vmas+0xb0/0x1d8
[c00000044037fa90] c0000000005af254 unmap_region+0xb4/0x128
[c00000044037fb50] c0000000005af400 vms_complete_munmap_vmas+0x138/0x310
[c00000044037fbe0] c0000000005b0f1c do_vmi_align_munmap+0x1ec/0x238
[c00000044037fd30] c0000000005b3688 __vm_munmap+0x170/0x1f8
[c00000044037fdf0] c000000000587f74 sys_munmap+0x2c/0x40
[c00000044037fe10] c000000000032668 system_call_exception+0x128/0x350
[c00000044037fe50] c00000000000d05c system_call_vectored_common+0x15c/0x2ec
---- Exception: 3000 (System Call Vectored) at 0000000010064a2c
SP (7fff9b1ee9c0) is in userspace
0:mon> zh
commit a30b48bf1b24 ("mm/migrate_device: implement THP migration of zone device pages"),
enabled migration for device-private PMD entries. Hence this is one
other path where this warning could get trigger from.
------------[ cut here ]------------
WARNING: arch/powerpc/mm/book3s64/hash_pgtable.c:199 at hash__pmd_hugepage_update+0x48/0x284, CPU#3: hmm-tests/1905
Modules linked in: test_hmm
CPU: 3 UID: 0 PID: 1905 Comm: hmm-tests Tainted: G B W L N 7.0.0-rc1-01438-g7e2f0ee7581c #21 PREEMPT
Tainted: [B]=BAD_PAGE, [W]=WARN, [L]=SOFTLOCKUP, [N]=TEST
Hardware name: IBM pSeries (emulated by qemu) POWER10 (architected) 0x801200 0xf000006 of:SLOF,git-ee03ae pSeries
NIP [c000000000096b70] hash__pmd_hugepage_update+0x48/0x284
LR [c000000000096e7c] hash__pmdp_huge_get_and_clear+0xd0/0xd4
Call Trace:
[c000000604707670] [c000000004e102b8] 0xc000000004e102b8 (unreliable)
[c000000604707700] [c00000000064ec3c] set_pmd_migration_entry+0x414/0x498
[c000000604707760] [c00000000063e5a4] migrate_vma_col
---truncated--- |
| In the Linux kernel, the following vulnerability has been resolved:
drm/amdgpu: fix AMDGPU_INFO_READ_MMR_REG
There were multiple issues in that code.
First of all the order between the reset semaphore and the mm_lock was
wrong (e.g. copy_to_user) was called while holding the lock.
Then we allocated memory while holding the reset semaphore which is also
a pretty big bug and can deadlock.
Then we used down_read_trylock() instead of waiting for the reset to
finish.
(cherry picked from commit 361b6e6b303d4b691f6c5974d3eaab67ca6dd90e) |
| In the Linux kernel, the following vulnerability has been resolved:
f2fs: protect extension_list reading with sb_lock in f2fs_sbi_show()
In f2fs_sbi_show(), the extension_list, extension_count and
hot_ext_count are read without holding sbi->sb_lock. If a concurrent
sysfs store modifies the extension list via f2fs_update_extension_list(),
the show path may read inconsistent count and array contents, potentially
leading to out-of-bounds access or displaying stale data.
Fix this by holding sb_lock around the entire extension list read
and format operation. |
| Notepad++ is a free and open-source source code editor. Prior to 8.9.6.4, NppCommands.cpp checks the HMAC of the on-disk shortcuts.xml at the moment a user command fires (Time-of-Check). However, the command payload is taken from the in-memory _userCommands vector, which is populated at application startup and never re-synchronized with the on-disk file (Time-of-Use). Swapping shortcuts.xml between startup and command execution causes the HMAC check to validate a clean file while a malicious command runs. An attacker with write access to shortcuts.xml places a malicious version on disk before launch, then immediately restores the legitimate file. The HMAC check at execution time validates the restored legitimate file (check passes), while the malicious payload executes from memory. This vulnerability is fixed in 8.9.6.4. |
| Pi is a minimal terminal coding harness. From 0.74.0 until 0.78.1, Pi stored API keys and OAuth credentials in auth.json. A race condition in the file write path could briefly create or rewrite this file with permissions derived from the process umask before tightening the file to owner-only permissions. This vulnerability is fixed in 0.78.1. |
| The Iptanus File Upload WordPress plugin before 5.1.7 does not implement proper file handling when the duplicatepolicy setting is configured to "maintain both." Due to a Time-of-Check to Time-of-Use (TOCTOU) race condition between the file existence check and the actual file write operation, an authenticated attacker can overwrite files uploaded by other users. |
| An authentication
bypass security issue exists within FactoryTalk Historian Site Edition. By
continually sending requests to the login endpoint, an attacker may obtain a
valid authentication token. |
| In EmberZNet v9.0.2 and earlier, a malformed Level Control Move command can terminate the process through a divide-by-zero fault. This command must come from a device that has already joined the network. Only devices supporting the Level Control cluster may be impacted. |
| In EmberZNet v9.0.2 and earlier, a malformed Level Control Step command can terminate the process through a divide-by-zero fault. This command must come from a device that has already joined the network. Only devices supporting the Level Control cluster may be impacted. |
| In the Linux kernel, the following vulnerability has been resolved:
bpf, sockmap: Fix af_unix null-ptr-deref in proto update
unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
socket is properly set up, which would include having a defined peer. IOW,
there's a window when unix_stream_bpf_update_proto() can be called on
socket which still has unix_peer(sk) == NULL.
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
sock_hold(newsk)
smp_mb__after_atomic()
unix_peer(sk) = newsk
BUG: kernel NULL pointer dereference, address: 0000000000000080
RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
Call Trace:
sock_map_link+0x564/0x8b0
sock_map_update_common+0x6e/0x340
sock_map_update_elem_sys+0x17d/0x240
__sys_bpf+0x26db/0x3250
__x64_sys_bpf+0x21/0x30
do_syscall_64+0x6b/0x3a0
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Initial idea was to move peer assignment _before_ the sk_state update[1],
but that involved an additional memory barrier, and changing the hot path
was rejected.
Then a NULL check during proto update in unix_stream_bpf_update_proto() was
considered[2], but the follow-up discussion[3] focused on the root cause,
i.e. sockmap update taking a wrong lock. Or, more specifically, missing
unix_state_lock()[4].
In the end it was concluded that teaching sockmap about the af_unix locking
would be unnecessarily complex[5].
Complexity aside, since BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT
are allowed to update sockmaps, sock_map_update_elem() taking the unix
lock, as it is currently implemented in unix_state_lock():
spin_lock(&unix_sk(s)->lock), would be problematic. unix_state_lock() taken
in a process context, followed by a softirq-context TC BPF program
attempting to take the same spinlock -- deadlock[6].
This way we circled back to the peer check idea[2].
[1]: https://lore.kernel.org/netdev/ba5c50aa-1df4-40c2-ab33-a72022c5a32e@rbox.co/
[2]: https://lore.kernel.org/netdev/20240610174906.32921-1-kuniyu@amazon.com/
[3]: https://lore.kernel.org/netdev/7603c0e6-cd5b-452b-b710-73b64bd9de26@linux.dev/
[4]: https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+3og@mail.gmail.com/
[5]: https://lore.kernel.org/bpf/CAAVpQUAHijOMext28Gi10dSLuMzGYh+jK61Ujn+fZ-wvcODR2A@mail.gmail.com/
[6]: https://lore.kernel.org/bpf/dd043c69-4d03-46fe-8325-8f97101435cf@linux.dev/
Summary of scenarios where af_unix/stream connect() may race a sockmap
update:
1. connect() vs. bpf(BPF_MAP_UPDATE_ELEM), i.e. sock_map_update_elem_sys()
Implemented NULL check is sufficient. Once assigned, socket peer won't
be released until socket fd is released. And that's not an issue because
sock_map_update_elem_sys() bumps fd refcnf.
2. connect() vs BPF program doing update
Update restricted per verifier.c:may_update_sockmap() to
BPF_PROG_TYPE_TRACING/BPF_TRACE_ITER
BPF_PROG_TYPE_SOCK_OPS (bpf_sock_map_update() only)
BPF_PROG_TYPE_SOCKET_FILTER
BPF_PROG_TYPE_SCHED_CLS
BPF_PROG_TYPE_SCHED_ACT
BPF_PROG_TYPE_XDP
BPF_PROG_TYPE_SK_REUSEPORT
BPF_PROG_TYPE_FLOW_DISSECTOR
BPF_PROG_TYPE_SK_LOOKUP
Plus one more race to consider:
CPU0 bpf CPU1 connect
-------- ------------
WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
sock_hold(newsk)
smp_mb__after_atomic()
---truncated--- |
| In the Linux kernel, the following vulnerability has been resolved:
net: ibm: emac: Fix use-after-free during device removal
The driver was using devm_register_netdev() which causes unregister_netdev()
to be deferred until the devres cleanup phase, which runs after emac_remove()
returns. This creates a use-after-free window where:
1. emac_remove() is called, which tears down hardware (cancels work, detaches
modules, unregisters from MAL)
2. emac_remove() returns
3. devres cleanup runs and finally calls unregister_netdev()
During step 3, the network stack might still process packets, triggering
emac_irq(), emac_poll(), or other handlers that access now-freed hardware
resources (dev->emacp, dev->mal, etc.).
Fix this by replacing devm_register_netdev() with manual register_netdev()
and calling unregister_netdev() at the beginning of emac_remove(), before
any hardware teardown. This ensures the network device is fully stopped and
unregistered before hardware resources are released.
The change is safe because:
- dev->ndev is assigned very early in probe (before any error paths that
could bypass emac_remove)
- platform_set_drvdata() is only called after successful registration, so
emac_remove() only runs for fully registered devices
- unregister_netdev() is idempotent and safe to call on any registered device |