Skip to content

Commit 4523c7b

Browse files
authored
blog edit (#1995)
Signed-off-by: Chris Abraham <[email protected]>
1 parent f6cf100 commit 4523c7b

File tree

1 file changed

+4
-4
lines changed

1 file changed

+4
-4
lines changed

_posts/2025-04-23-pytorch-2-7.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -41,13 +41,13 @@ This release is composed of 3262 commits from 457 contributors since PyTorch 2.6
4141
<tr>
4242
<td>
4343
</td>
44-
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on X86 CPUs
44+
<td>FlexAttention LLM <span style="text-decoration:underline;">first token processing</span> on x86 CPUs
4545
</td>
4646
</tr>
4747
<tr>
4848
<td>
4949
</td>
50-
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on X86 CPUs
50+
<td>FlexAttention LLM <span style="text-decoration:underline;">throughput mode optimization</span> on x86 CPUs
5151
</td>
5252
</tr>
5353
<tr>
@@ -135,9 +135,9 @@ For more information regarding Intel GPU support, please refer to [Getting Start
135135
See also the tutorials [here](https://pytorch.org/tutorials/prototype/inductor_windows.html) and [here](https://pytorch.org/tutorials/prototype/pt2e_quant_xpu_inductor.html).
136136

137137

138-
### [Prototype] FlexAttention LLM first token processing on X86 CPUs
138+
### [Prototype] FlexAttention LLM first token processing on x86 CPUs
139139

140-
FlexAttention X86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.
140+
FlexAttention x86 CPU support was first introduced in PyTorch 2.6, offering optimized implementations — such as PageAttention, which is critical for LLM inference—via the TorchInductor C++ backend. In PyTorch 2.7, more attention variants for first token processing of LLMs are supported. With this feature, users can have a smoother experience running FlexAttention on x86 CPUs, replacing specific *scaled_dot_product_attention* operators with a unified FlexAttention API, and benefiting from general support and good performance when using torch.compile.
141141

142142

143143
### [Prototype] FlexAttention LLM throughput mode optimization

0 commit comments

Comments
 (0)