-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-48710][PYTHON] Use NumPy 2.0 compatible types #47083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cc @itholic |
@@ -176,7 +176,7 @@ def as_spark_type( | |||
return None | |||
return types.ArrayType(element_type) | |||
# BinaryType | |||
elif tpe in (bytes, np.character, np.bytes_, np.string_): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: why do we remove np.string_
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nvm. I just checked the PR description.
Let's use the default PR template:
|
@itholic thanks for reviewing! Maybe it makes sense to update the GitHub jobs to test with both the lowest supported (i.e. |
Oh, okay seems like the NumPy is upgraded their major version recently (2024-06-17): Release Note. @HyukjinKwon Maybe should we upgrade the minimum NumPy support to 2.0.0 as we did for Pandas?? Also cc @zhengruifeng who have worked on similar PR from #42944. |
I think that's too aggressive. NumPy is also used in Spark ML, and many other dependent proejcts |
Seems fine from a cursory look but let's make the CI happy :-). |
CI passed ! |
There are some linter failures: https://github.com/codesorcery/spark/actions/runs/9708335267/job/26795030107 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we test this changes against numpy 2.0?
@@ -5370,6 +5370,17 @@ def _test() -> None: | |||
import tempfile | |||
from pyspark.core.context import SparkContext | |||
|
|||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we add a todo that once we upgrade the minimum version >= 2.0, we can remove this try-except and update the doc tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There exist also multiple tests already with:
try:
# Numpy 1.14+ changed it's string format.
numpy.set_printoptions(legacy="1.13")
except TypeError:
pass
I'd guess these should be considered for updating before that (since the minimum NumPy version is at 1.21
currently).
|
||
os.chdir(os.environ["SPARK_HOME"]) | ||
|
||
if Version(np.__version__) >= Version("2"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Added |
I've tested it on my local workstation, as written in the PR description. There aren't any CI jobs testing with NumPy 2.0 yet. |
The change seems fine to me. @codesorcery do you mind creating a PR to set the upperbound at |
@HyukjinKwon you mean for branches where this PR doesn't get applied? Otherwise, most Python package managers and tools like Renovate won't allow users to update to NumPy 2 with this bound set. Maybe also of interest: there is a list tracking the compatibility status of Python libraries with NumPy 2.0 at numpy/numpy#26191 |
Yes, |
@HyukjinKwon here's the PR for |
It should be possible to write code that is compatible with NumPy 1 & 2. That is what most projects are doing Would look over the migration guide. There are more suggestions in the release notes As already noted cc @rgommers (for awareness) |
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. #47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>
Merged to master. |
@codesorcery @HyukjinKwon I noticed that a review comment here said |
@rgommers |
Ah there are tests within source files, I missed that. Sorry for the noise! |
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts * Make tests compatible to new `__repr__` of numerical scalars ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0: * `np.NaN` was removed, should be replaced with `np.nan` * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_` * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double` * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_` NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47083 from codesorcery/SPARK-48710. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @codesorcery , @HyukjinKwon , @zhengruifeng , @itholic .
This PR seems to introduce numpy
dependency accidentally to core/rdd
module.
Starting test(python3): pyspark.core.rdd (temp output: /Users/dongjoon/APACHE/spark-merge/python/target/6da9b910-0500-479c-85ef-89e4bd085853/python3__pyspark.core.rdd__oldy8rob.log)
<frozen runpy>:128: RuntimeWarning: 'pyspark.core.rdd' found in sys.modules after import of package 'pyspark.core', but prior to execution of 'pyspark.core.rdd'; this may result in unpredictable behaviour
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5400, in <module>
_test()
~~~~~^^
File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
import numpy as np
ModuleNotFoundError: No module named 'numpy'
core
module should not have this dependency even in the test case.
If there are IIUC, this problem is already pointed out by @rgommers one month ago in this PR. |
try: | ||
# Numpy 2.0+ changed its string format, | ||
# adding type information to numeric scalars. | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here actually did a try catch but I think there's some issue related to import ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to sql and ml only for now because both modules use numpy
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
Yes, the problem is that try-catch
didn't handle ModuleNotFoundError
and causes failure like the following.
$ python/run-tests.py --python-executables python3 --modules pyspark-core
...
File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test
import numpy as np
ModuleNotFoundError: No module named 'numpy'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to sql and ml only for now because both modules use
numpy
.
+1 for moving.
if Version(np.__version__) >= Version("2"): | ||
# `legacy="1.25"` only available in `nump>=2` | ||
np.set_printoptions(legacy="1.25") # type: ignore[arg-type] | ||
except TypeError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah let's catch ImportError ....
Here is a follow-up according to @HyukjinKwon 's comment. |
…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of #47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of apache#47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…1.15,<2) ### What changes were proposed in this pull request? * Add a constraint for `numpy<2` to the PySpark package ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0. Thus, if `numpy>=2` is installed, executing PySpark may fail. apache#47083 updates the `master` branch to be compatible with NumPy 2. This PR adds a version bound for older releases, where it won't be applied. ### Does this PR introduce _any_ user-facing change? NumPy will be limited to `numpy<2` when installing `pypspark` with extras `ml`, `mllib`, `sql`, `pandas_on_spark` or `connect`. ### How was this patch tested? Via existing CI jobs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47175 from codesorcery/SPARK-48710-numpy-upper-bound. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 44eba46) Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? * Replace NumPy types removed in NumPy 2.0 with their equivalent counterparts * Make tests compatible to new `__repr__` of numerical scalars ### Why are the changes needed? PySpark references some code which was removed with NumPy 2.0: * `np.NaN` was removed, should be replaced with `np.nan` * `np.string_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134) `np.bytes_` * `np.float_` was removed, [is defined the same as](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043) `np.double` * `np.unicode_` was removed, [is an alias for](https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148) `np.str_` NumPy 2.0 changed the `__repr__` of numerical scalars to contain type information (e.g. `np.int32(3)` instead of `3`). Old behavior can be enabled by setting `numpy.printoptions(legacy="1.25")` (or the older `1.21` and `1.13` legacy modes). There are multiple tests and doctests that rely on the old behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests for modules `pyspark-connect`, `pyspark-core`, `pyspark-errors`, `pyspark-mllib`, `pyspark-pandas`, `pyspark-sql`, `pyspark-resource`, `pyspark-testing` were executed in a local venv with `numpy==2.0.0` installed. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47083 from codesorcery/SPARK-48710. Authored-by: Patrick Marx <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of apache#47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…ptional dependencies ### What changes were proposed in this pull request? This is a follow-up of apache#47083 to recover PySpark RDD tests. ### Why are the changes needed? `PySpark Core` test should not fail on optional dependencies. **BEFORE** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... File "/Users/dongjoon/APACHE/spark-merge/python/pyspark/core/rdd.py", line 5376, in _test import numpy as np ModuleNotFoundError: No module named 'numpy' ``` **AFTER** ``` $ python/run-tests.py --python-executables python3 --modules pyspark-core ... Tests passed in 189 seconds Skipped tests in pyspark.tests.test_memory_profiler with python3: test_assert_vanilla_mode (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_assert_vanilla_mode) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_aggregate_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_aggregate_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_clear) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_cogroup_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_cogroup_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_arrow) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_group_apply_in_pandas (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_group_apply_in_pandas) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_map_in_pandas_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_map_in_pandas_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_iterator_not_supported (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_iterator_not_supported) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_pandas_udf_window (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_pandas_udf_window) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_multiple_actions (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_multiple_actions) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_registered (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_registered) ... skipped 'Must have memory-profiler installed.' test_memory_profiler_udf_with_arrow (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_memory_profiler_udf_with_arrow) ... skipped 'Must have memory-profiler installed.' test_profilers_clear (pyspark.tests.test_memory_profiler.MemoryProfiler2Tests.test_profilers_clear) ... skipped 'Must have memory-profiler installed.' test_code_map (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_code_map) ... skipped 'Must have memory-profiler installed.' test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_memory_profiler) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_function_api) ... skipped 'Must have memory-profiler installed.' test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_profile_pandas_udf) ... skipped 'Must have memory-profiler installed.' test_udf_line_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests.test_udf_line_profiler) ... skipped 'Must have memory-profiler installed.' Skipped tests in pyspark.tests.test_rdd with python3: test_take_on_jrdd_with_large_rows_should_not_cause_deadlock (pyspark.tests.test_rdd.RDDTests.test_take_on_jrdd_with_large_rows_should_not_cause_deadlock) ... skipped 'NumPy or Pandas not installed' Skipped tests in pyspark.tests.test_serializers with python3: test_statcounter_array (pyspark.tests.test_serializers.NumPyTests.test_statcounter_array) ... skipped 'NumPy not installed' test_serialize (pyspark.tests.test_serializers.SciPyTests.test_serialize) ... skipped 'SciPy not installed' Skipped tests in pyspark.tests.test_worker with python3: test_memory_limit (pyspark.tests.test_worker.WorkerMemoryTest.test_memory_limit) ... skipped "Memory limit feature in Python worker is dependent on Python's 'resource' module on Linux; however, not found or not on Linux." test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultNonDaemonTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' test_python_segfault (pyspark.tests.test_worker.WorkerSegfaultTest.test_python_segfault) ... skipped 'SPARK-46130: Flaky with Python 3.12' ``` ### Does this PR introduce _any_ user-facing change? No. The failure happens during testing. ### How was this patch tested? Pass the CIs and do the manual test without optional dependencies. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47526 from dongjoon-hyun/SPARK-48710. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
__repr__
of numerical scalarsWhy are the changes needed?
PySpark references some code which was removed with NumPy 2.0:
np.NaN
was removed, should be replaced withnp.nan
np.string_
was removed, is an alias fornp.bytes_
np.float_
was removed, is defined the same asnp.double
np.unicode_
was removed, is an alias fornp.str_
NumPy 2.0 changed the
__repr__
of numerical scalars to contain type information (e.g.np.int32(3)
instead of3
). Old behavior can be enabled by settingnumpy.printoptions(legacy="1.25")
(or the older1.21
and1.13
legacy modes). There are multiple tests and doctests that rely on the old behavior.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Tests for modules
pyspark-connect
,pyspark-core
,pyspark-errors
,pyspark-mllib
,pyspark-pandas
,pyspark-sql
,pyspark-resource
,pyspark-testing
were executed in a local venv withnumpy==2.0.0
installed.Was this patch authored or co-authored using generative AI tooling?
No.