deps: Update dependency pytorch to >=2.7.0 #299
+3
−3
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
>=2.6.0
->>=2.7.0
Release Notes
pytorch/pytorch (pytorch)
v2.7.0
: PyTorch 2.7.0 ReleasePyTorch 2.7.0 Release Notes
Highlights
For more details about these highlighted features, you can look at the release blogpost.
Below are the full release notes for this release.
Tracked Regressions
NCCL init hits CUDA failure 'invalid argument' on 12.2 driver
Some users with 12.2 CUDA driver (535 version) report seeing "CUDA driver error: invalid argument" during NCCL or Symmetric Memory initialization. This issue is currently under investigation, see #150852. If you use PyTorch from source, a known workaround is to rebuild PyTorch with CUDA 12.2 toolkit. Otherwise, you can try upgrading the CUDA driver on your system.
Backwards Incompatible Changes
Dropped support for Triton < 2.2.0. Removed Support for CUDA 12.4, Anaconda in CI/CD.
C++ Extensions
py_limited_api=True
is now built with-DPy_LIMITED_API
(#145764)We formally began respecting the
py_limited_api=True
kwarg in 2.6 and stopped linkinglibtorch_python.so
when the flag was specified, as libtorch_python.so does not guarantee using APIs from from the stable Python limited API. In 2.7, we go further by specifying the-DPy_LIMITED_API
flag which will enforce that the extension is buildable with the limited API. As a result of this enforcement, custom extensions that setpy_limited_api=True
but do not abide by the limited API may fail to build. For an example, see #152243.This is strictly better behavior as it is sketchy to claim CPython agnosticism without enforcing with the flag. If you run into this issue, please ensure that the extension you are building does not use any APIs which are outside of the Python limited API, e.g.,
pybind
.Change
torch.Tensor.new_tensor()
to be on the given Tensor's device by default (#144958)This function was always creating the new Tensor on the "cpu" device and will now use the same device as the current Tensor object. This behavior is now consistent with other
.new_*
methods.Use Manylinux 2.28 and CXX11_ABI=1 for future released Linux wheel builds.
With Migration to manylinux_2_28 (AlmaLinux 8 based), we can no longer support OS distros with glibc2_26. These include popular Amazon Linux 2 and CentOS 7. (#143423, #146200, #148028, #148135, #148195, #148129)
torch.onnx.dynamo_export
now uses the ExportedProgram logic path (#137296)Users using the
torch.onnx.dynamo_export
API may see someExportOptions
becomeunsupported due to an internal switch to use
torch.onnx.export(..., dynamo=True)
:diagnostic_options
,fake_context
andonnx_registry
are removed/ignored byExportOptions
. Onlydynamic_shapes
is retained.Users should move to use the
dynamo=True
option ontorch.onnx.export
astorch.onnx.dynamo_export
is now deprecated. Leverage thedynamic_shapes
argument intorch.onnx.export
for specifying dynamic shapes on the model.Version 2.6.0
Version 2.7.0
Finish deprecation of
LRScheduler.print_lr()
along with theverbose
kwarg to the LRScheduler constructor. (#147301)Both APIs have been deprecated since 2.2. Please use
LRScheduler.get_last_lr()
to access the learning rate instead.print_lr
andverbose
were confusing, not properly documented and were little used, as described in #99270, so we deprecated them in 2.2. Now, we complete the deprecation by removing them completely. To access and print the learning rate of a LRScheduler:Version 2.6.0
Version 2.7.0
libtorch_python.so symbols are now invisible by default on all platforms except Apple (#142214)
Previously, the symbols in libtorch_python.so were exposed with default visibility. We have transitioned to being more intentional about what we expose as public symbols for our python API in C++. After #142214, public symbols will be marked explicitly while everything else will be hidden. Some extensions using private symbols will see linker failures with this change.
Please use
torch.export.export
instead ofcapture_pre_autograd_graph
to export the model for pytorch 2 export quantization (#139505)capture_pre_autograd_graph
was a temporary API intorch.export
. Since now we have a better longer term API:export
available, we can deprecate it.Version 2.6.0
Version 2.7.0
New interface for
torch.fx.passes.graph_transform_observer.GraphTransformObserver
to enable Node Level provenance tracking (#144277)We now track a mapping between the nodes in the pre-grad and post-grad graph. See the issue for an example frontend to visualize the transformations. To update your
GraphTransformObserver
subclasses, instead of overridingon_node_creation
andon_node_erase
, there are new functionsget_node_creation_hook
,get_node_erase_hook
,get_node_replace_hook
andget_deepcopy_hook
. These are registered on theGraphModule
member of theGraphTransformObserver
upon entry and exit of awith
blockVersion 2.6.0
Version 2.7.0
torch.ao.quantization.pt2e.graph_utils.get_control_flow_submodules
is no longer public (#141612)We are planning to make all functions under
torch.ao.quantization.pt2e.graph_utils
private. This update marksget_control_flow_submodules
as a private API. If you have to or want to continue usingget_control_flow_submodules
, please make a private call by using_get_control_flow_submodules
.Example:
Version 2.6:
Version 2.7:
Deprecations
torch.onnx.dynamo_export
is deprecated (#146425, #146639, #146923)Users should use the
dynamo=True
option ontorch.onnx.export
.Version 2.6.0
Version 2.7.0
XNNPACKQuantizer
is deprecated in PyTorch and moved to ExecuTorch, please use it fromexecutorch.backends.xnnpack.quantizer.xnnpack_quantizer
instead oftorch.ao.quantization.quantizer.xnnpack_quantizer
. (#144940)XNNPACKQuantizer
is a quantizer for xnnpack that was added into pytorch/pytorch for initial development. However, as it is not related to our core quantization workflow, we have moved it to ExecuTorch instead. Please use it fromexecutorch.backends.xnnpack.quantizer.xnnpack_quantizer
instead oftorch.ao.quantization.quantizer.xnnpack_quantizer
.Version 2.6.0
Version 2.7.0
New features
Release Engineering
Python Frontend
torch.utils.serialization.config
namespace for all serialization related configurations (#143324)torch.serialization.config.save.use_pinned_memory_for_d2h
to speed uptorch.save
when passed gpu devices (#143342)torch.utils.serialization.config.load.calculate_storage_offsets
to reduce random reads and significantly improve performance for storage with bad random access performance (#143880)__torch_function__
handler on dtype arguments, similar to subclass objects (#145085)C++ Extensions
Distributed
Context Parallel
torch.nn.functional.scaled_dot_product_attention
over the sequence dimension. We implementedRing Attention (#131351) and an AllGather-based approach (#132820) where the all-gather is issued before the first local SDPA
and the subsequent local SDPAs will have to wait until the all-gather completes, and offered a user API (#142093) to select the desired approach. The implementation
currently supports three SDPA kernels:
SDPBackend.FLASH_ATTENTION
,SDPBackend.EFFICIENT_ATTENTION
, andSDPBackend.CUDNN_ATTENTION
(#148537). We alsoverified that our Context Parallel implementation is compatible with other parallelisms and
torch.compile
.c10d
Distributed Checkpoint (DCP)
CUDA
torch.compile
(#145270)torch.cuda.gds
APIs public (#147120)MPS
ROCm
XPU
torch.compile
on Windows Platform for XPU (#147637, #144316, #149511)torch.utils.cpp_extension
APIs (#132945)torch.compile
Dynamo
contextlib.contextmanager
in Dynamo (#136033)nonstrict_trace
escape hatch to apply non-strict tracing to difficult-to-compile code (#146367)list
subclasses (#146819)Inductor
head_dim
for FlexAttention (#133495).num_warps
andnum_stages
(#139639).ConfigFuzzer
: a new debugging tool designed to fuzz Torch compile configurations. Given a test function, it will identify combinations of configs that throw errors during compilation and execution (#139736) (#145565).TORCHINDUCTOR_PROLOGUE_FUSION
enables this feature (#147008).TORCHINDUCTOR_CUTLASS_INSTANTIATION_LEVEL
. Consult config.py for information (#146230).cuda.cutlass_max_profiling_swizzle_options
(#146088).package_cpp_only
is specified in AOTI (#143352).graph_partition
functions. Set thegraph_partition
in inductor config to enable (#147038).Profiler
experimentalConfig
(#143659)Quantization
torch.ops.aten._dyn_quant_matmul_4bit
, while the weights, scaled and optional bias are packed intorch.ops.aten._dyn_quant_pack_4bit_weight
. To use it on your model you can quantize it using the following example that leveragestorchao
:ONNX
torch.onnx.verification.verify_onnx_program
(#148396, #148706, #148730, #148707)A new verification API
torch.onnx.verification.verify_onnx_program
can now be used to verify numerical accuracy of the exported ONNX model. Users can use thecompare_intermediates
option to identify any operator that causes numerical discrepancies in intermediate tensors. It is possible to use a tool like model-explorer to visualize the verification results.dynamic_shapes
(#146321)torch.onnx.export(dynamo=True)
now optimizes the output model by default (#146187)Improvements
Release Engineering
Python Frontend
torch.addcmul
(#143264)-DPy_LIMITED_API
flag forpy_limited_api=True
cpp_extensions (#145764)torch.jit.load
(#143403)torch.save
configurable (#147788)with
statement on torch.Stream (#140138)Autograd
torch.autograd.graph.GradientEdge
astorch.autograd.backward
outputs #144744residuals
oftorch.linalg.lstsq
#148526reflection_pad2d_backward
(#136241)Dataloader
in_order
isFalse
(#142324)device
argument.device
andpin_memory_device
are discouraged and will be deprecated in the future. (#131858)Linear Algebra
torch.cum{min,max}
. (#143920)Nested Tensor (NJT)
chunk()
backward on batch dim (#144584)*_like
factory functions for NJT (#144889)matmul
with NJTs via backward support and composition with dense tensors (#144587, #146405)torch.nn
strict
kwarg tonn.Module.set_submodule
and fix bug for non dot-delineated strings (#143455)reflection_pad1d
,reflection_pad2d
andreflection_pad3d
(#141670)torch.optim
Build Frontend
C++ Frontend
isAcceleratorExcluded
(#144959)Distributed
c10d
abort
andshutdown
by adding both toBackend
andProcessGroup
objects (#148798)new_group
instead ofsplit_group
on non-CUDA device (#141469)call_guard
in pybind object init of c10d (#143598)getDefaultBackend
more fault tolerant (#148596)DistributedDataParallel (DDP)
init_sync
option to control collectives during initialization (#142824)FullyShardedDataParallel2 (FSDP2)
reduce_dtype
in lazy init (#143297)DTensor
aten.amin/amax
tolinear_reduction_strategy
(#143747)src_data_rank
todistribute_tensor
API (#143883)_scaled_mm
(#143760)aten.view.dtype
op support (#144404)shard_dim_alltoall
to usealltoall_single
(#148868)_shard_tensor
to usesrc_data_rank=None
(#144171)aten.minimum
(#145816)TensorParallel
src_data_rank
kwarg in TP API (#144005)Torch Elastic
etcd_rendezvous
publicly importable (#145396)Pipelining
generate_stage_to_rank_mapping
utility (#146193)stage_index_to_group_rank
from schedule (#146217)CPU
General
x86
brgemm
(#143384)CUDA
sharedMemPerMultiprocessor
device property to python (#143119)cudaDeviceProps
to python (#143226)index >= 0
of cuda device (#140791)get_stream_from_external
API for CUDA backend (#143799)MPS
angle
,entr
,spherical_bessel_j0
,xlog1py
,sinc
,round.decimals
,linalg.det
,cholesky.ex
,bilineard2d_aa
,linalg.solve
,zeta
,cholesky
,fused_rms_norm
,lu_unpack
,lu_factor_ex
,slogdet
andlogdet
(#143449, #147948, #146818, #147687, #146539, #147266, #146279, #146799, #145526, #146531, #146465, #145701, #145301, #146681, #144651, #145341, #146771, #147914)angle
andatan2
for long type,torch.special.sinc
to complex,torch.mm
/torch.bmm
to integral types (#149017, #146648, #145809, #147526)torch.accelerator.synchronize()
on MPS (#143171)gamma
,zeta
,sinc
,spherical_bessel_j0
,entr
(#145341, #146465, #146539, #147650, #148128)ROCm
XPU
convolution_backward
output layout between fake tensor and real output tensor (#146880)torch.xpu.get_device_properties
API error message (#144379)nested_layer_norm
support for XPU (#148593)is_big_gpu()
check in Inductor (#143491)Profiler
torch.compile
Dynamo
dict
subclasses (#143548)trace_rules.py
skipfiles (#145856)transformers
ModelOutput
) (#143567)AOTDispatcher
buffer.copy_(int)
(#141161)torch.inference_mode
(#147925)Dynamic Shapes
topk
(#147017)interpolate(antialias=True)
backward (#141198)nonzer_static
(#146006)_compute_symbolic_stride()
(#138844)torch._check
(#144471)backed_size_oblivious
config (#148696)mark_unbacked
strict mode (#147333, #147342)Decompositions, FakeTensor and meta tensors
Several operator decomps received improvements/bugfixes:
torch._refs.tensor
(#143461)torch._refs.mean
(#147188)linspace
(#147997)addmv
(#143792)New meta tensor implementations for a few pytorch operators:
nonzero
(#144727)silu
,sigmoid
,_softmax
,embedding
(#147862)New fake tensor implementation for a few pytorch operators:
unique_consecutive
(#145649)Several general FakeTensor improvements
UntypedStorage.from_buffer(buf)
to return meta storage under FakeTensorMode (#146642)meta_tensor.to(device='cpu')
underfake_mode
(#146729)Inductor
swizzle
(#147223).INDUCTOR_CPP_ENABLE_FLOATING_POINT_CONTRACT_FLAG
will be passed toffp-contract
(#143450).TORCHINDUCTOR_WORKER_START
to one of "subprocess", "fork", or "spawn" (#144491).do_bench
(#133058).emulate_precision_casts
:TORCHINDUCTOR_EMULATE_PRECISION_CASTS
(#145948).TORCHINDUCTOR_CUTLASS_ALLOWLIST
andTORCHINDUCTOR_CUTLASS_DENYLIST
(#148161).TORCHINDUCTOR_SCALAR_ASSERTS
(#146462).layout_optimization
andcomprehensive_padding
(#148450).AOT_INDUCTOR_COMPILE_WRAPPER_WITH_O0=1
(#144866).global_scratch
arg, fix cpp_wrapper (#148051, #149973)._int_mm
in AOTI (#144571).torch.ops.aten._assert_tensor_metadata.default
for AOTI (#145028).aot_compile
andaoti_compile_and_package
(#148506)._weight_int4pack_mm_cpu_tensor
(#149031)torch.fx
torch.export
serialization
"+export"
logging to de/serialization process (#145283)builtins.getattr
with serializable higher-order-op foConfiguration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.