Skip to content

Commit 6a3cbbc

Browse files
holdenkJoshRosen
authored andcommitted
[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed
## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <[email protected]> Author: Juliet Hougland <[email protected]> Author: Juliet Hougland <[email protected]> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
1 parent 9515793 commit 6a3cbbc

31 files changed

+660
-24
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@ project/plugins/project/build.properties
5757
project/plugins/src_managed/
5858
project/plugins/target/
5959
python/lib/pyspark.zip
60+
python/deps
61+
python/pyspark/python
6062
reports/
6163
scalastyle-on-compile.generated.xml
6264
scalastyle-output.xml

bin/beeline

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ set -o posix
2525

2626
# Figure out if SPARK_HOME is set
2727
if [ -z "${SPARK_HOME}" ]; then
28-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
28+
source "$(dirname "$0")"/find-spark-home
2929
fi
3030

3131
CLASS="org.apache.hive.beeline.BeeLine"

bin/find-spark-home

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
#!/usr/bin/env bash
2+
3+
#
4+
# Licensed to the Apache Software Foundation (ASF) under one or more
5+
# contributor license agreements. See the NOTICE file distributed with
6+
# this work for additional information regarding copyright ownership.
7+
# The ASF licenses this file to You under the Apache License, Version 2.0
8+
# (the "License"); you may not use this file except in compliance with
9+
# the License. You may obtain a copy of the License at
10+
#
11+
# http://www.apache.org/licenses/LICENSE-2.0
12+
#
13+
# Unless required by applicable law or agreed to in writing, software
14+
# distributed under the License is distributed on an "AS IS" BASIS,
15+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
16+
# See the License for the specific language governing permissions and
17+
# limitations under the License.
18+
#
19+
20+
# Attempts to find a proper value for SPARK_HOME. Should be included using "source" directive.
21+
22+
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
23+
24+
# Short cirtuit if the user already has this set.
25+
if [ ! -z "${SPARK_HOME}" ]; then
26+
exit 0
27+
elif [ ! -f "$FIND_SPARK_HOME_PYTHON_SCRIPT" ]; then
28+
# If we are not in the same directory as find_spark_home.py we are not pip installed so we don't
29+
# need to search the different Python directories for a Spark installation.
30+
# Note only that, if the user has pip installed PySpark but is directly calling pyspark-shell or
31+
# spark-submit in another directory we want to use that version of PySpark rather than the
32+
# pip installed version of PySpark.
33+
export SPARK_HOME="$(cd "$(dirname "$0")"/..; pwd)"
34+
else
35+
# We are pip installed, use the Python script to resolve a reasonable SPARK_HOME
36+
# Default to standard python interpreter unless told otherwise
37+
if [[ -z "$PYSPARK_DRIVER_PYTHON" ]]; then
38+
PYSPARK_DRIVER_PYTHON="${PYSPARK_PYTHON:-"python"}"
39+
fi
40+
export SPARK_HOME=$($PYSPARK_DRIVER_PYTHON "$FIND_SPARK_HOME_PYTHON_SCRIPT")
41+
fi

bin/load-spark-env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323

2424
# Figure out where Spark is installed
2525
if [ -z "${SPARK_HOME}" ]; then
26-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
26+
source "$(dirname "$0")"/find-spark-home
2727
fi
2828

2929
if [ -z "$SPARK_ENV_LOADED" ]; then

bin/pyspark

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
source "${SPARK_HOME}"/bin/load-spark-env.sh
@@ -46,7 +46,7 @@ WORKS_WITH_IPYTHON=$(python -c 'import sys; print(sys.version_info >= (2, 7, 0))
4646

4747
# Determine the Python executable to use for the executors:
4848
if [[ -z "$PYSPARK_PYTHON" ]]; then
49-
if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! WORKS_WITH_IPYTHON ]]; then
49+
if [[ $PYSPARK_DRIVER_PYTHON == *ipython* && ! $WORKS_WITH_IPYTHON ]]; then
5050
echo "IPython requires Python 2.7+; please install python2.7 or set PYSPARK_PYTHON" 1>&2
5151
exit 1
5252
else
@@ -68,7 +68,7 @@ if [[ -n "$SPARK_TESTING" ]]; then
6868
unset YARN_CONF_DIR
6969
unset HADOOP_CONF_DIR
7070
export PYTHONHASHSEED=0
71-
exec "$PYSPARK_DRIVER_PYTHON" -m $1
71+
exec "$PYSPARK_DRIVER_PYTHON" -m "$1"
7272
exit
7373
fi
7474

bin/run-example

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
export _SPARK_CMD_USAGE="Usage: ./bin/run-example [options] example-class [example args]"

bin/spark-class

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
. "${SPARK_HOME}"/bin/load-spark-env.sh
@@ -27,7 +27,7 @@ fi
2727
if [ -n "${JAVA_HOME}" ]; then
2828
RUNNER="${JAVA_HOME}/bin/java"
2929
else
30-
if [ `command -v java` ]; then
30+
if [ "$(command -v java)" ]; then
3131
RUNNER="java"
3232
else
3333
echo "JAVA_HOME is not set" >&2
@@ -36,7 +36,7 @@ else
3636
fi
3737

3838
# Find Spark jars.
39-
if [ -f "${SPARK_HOME}/RELEASE" ]; then
39+
if [ -d "${SPARK_HOME}/jars" ]; then
4040
SPARK_JARS_DIR="${SPARK_HOME}/jars"
4141
else
4242
SPARK_JARS_DIR="${SPARK_HOME}/assembly/target/scala-$SPARK_SCALA_VERSION/jars"

bin/spark-shell

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,15 +21,15 @@
2121
# Shell script for starting the Spark Shell REPL
2222

2323
cygwin=false
24-
case "`uname`" in
24+
case "$(uname)" in
2525
CYGWIN*) cygwin=true;;
2626
esac
2727

2828
# Enter posix mode for bash
2929
set -o posix
3030

3131
if [ -z "${SPARK_HOME}" ]; then
32-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
32+
source "$(dirname "$0")"/find-spark-home
3333
fi
3434

3535
export _SPARK_CMD_USAGE="Usage: ./bin/spark-shell [options]"

bin/spark-sql

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
export _SPARK_CMD_USAGE="Usage: ./bin/spark-sql [options] [cli option]"

bin/spark-submit

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
# disable randomized hash for string in Python 3.3+

bin/sparkR

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#
1919

2020
if [ -z "${SPARK_HOME}" ]; then
21-
export SPARK_HOME="$(cd "`dirname "$0"`"/..; pwd)"
21+
source "$(dirname "$0")"/find-spark-home
2222
fi
2323

2424
source "${SPARK_HOME}"/bin/load-spark-env.sh

dev/create-release/release-build.sh

Lines changed: 24 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -162,14 +162,35 @@ if [[ "$1" == "package" ]]; then
162162
export ZINC_PORT=$ZINC_PORT
163163
echo "Creating distribution: $NAME ($FLAGS)"
164164

165+
# Write out the NAME and VERSION to PySpark version info we rewrite the - into a . and SNAPSHOT
166+
# to dev0 to be closer to PEP440. We use the NAME as a "local version".
167+
PYSPARK_VERSION=`echo "$SPARK_VERSION+$NAME" | sed -r "s/-/./" | sed -r "s/SNAPSHOT/dev0/"`
168+
echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py
169+
165170
# Get maven home set by MVN
166171
MVN_HOME=`$MVN -version 2>&1 | grep 'Maven home' | awk '{print $NF}'`
167172

168-
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz $FLAGS \
173+
echo "Creating distribution"
174+
./dev/make-distribution.sh --name $NAME --mvn $MVN_HOME/bin/mvn --tgz --pip $FLAGS \
169175
-DzincPort=$ZINC_PORT 2>&1 > ../binary-release-$NAME.log
170176
cd ..
171-
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
172177

178+
echo "Copying and signing python distribution"
179+
PYTHON_DIST_NAME=pyspark-$PYSPARK_VERSION.tar.gz
180+
cp spark-$SPARK_VERSION-bin-$NAME/python/dist/$PYTHON_DIST_NAME .
181+
182+
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
183+
--output $PYTHON_DIST_NAME.asc \
184+
--detach-sig $PYTHON_DIST_NAME
185+
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
186+
MD5 $PYTHON_DIST_NAME > \
187+
$PYTHON_DIST_NAME.md5
188+
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --print-md \
189+
SHA512 $PYTHON_DIST_NAME > \
190+
$PYTHON_DIST_NAME.sha
191+
192+
echo "Copying and signing regular binary distribution"
193+
cp spark-$SPARK_VERSION-bin-$NAME/spark-$SPARK_VERSION-bin-$NAME.tgz .
173194
echo $GPG_PASSPHRASE | $GPG --passphrase-fd 0 --armour \
174195
--output spark-$SPARK_VERSION-bin-$NAME.tgz.asc \
175196
--detach-sig spark-$SPARK_VERSION-bin-$NAME.tgz
@@ -208,6 +229,7 @@ if [[ "$1" == "package" ]]; then
208229
# Re-upload a second time and leave the files in the timestamped upload directory:
209230
LFTP mkdir -p $dest_dir
210231
LFTP mput -O $dest_dir 'spark-*'
232+
LFTP mput -O $dest_dir 'pyspark-*'
211233
exit 0
212234
fi
213235

dev/create-release/release-tag.sh

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ sed -i".tmp1" 's/Version.*$/Version: '"$RELEASE_VERSION"'/g' R/pkg/DESCRIPTION
6565
# Set the release version in docs
6666
sed -i".tmp1" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$RELEASE_VERSION"'/g' docs/_config.yml
6767
sed -i".tmp2" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$RELEASE_VERSION"'/g' docs/_config.yml
68+
sed -i".tmp3" 's/__version__ = .*$/__version__ = "'"$RELEASE_VERSION"'"/' python/pyspark/version.py
6869

6970
git commit -a -m "Preparing Spark release $RELEASE_TAG"
7071
echo "Creating tag $RELEASE_TAG at the head of $GIT_BRANCH"
@@ -74,12 +75,16 @@ git tag $RELEASE_TAG
7475
$MVN versions:set -DnewVersion=$NEXT_VERSION | grep -v "no value" # silence logs
7576
# Remove -SNAPSHOT before setting the R version as R expects version strings to only have numbers
7677
R_NEXT_VERSION=`echo $NEXT_VERSION | sed 's/-SNAPSHOT//g'`
77-
sed -i".tmp2" 's/Version.*$/Version: '"$R_NEXT_VERSION"'/g' R/pkg/DESCRIPTION
78+
sed -i".tmp4" 's/Version.*$/Version: '"$R_NEXT_VERSION"'/g' R/pkg/DESCRIPTION
79+
# Write out the R_NEXT_VERSION to PySpark version info we use dev0 instead of SNAPSHOT to be closer
80+
# to PEP440.
81+
sed -i".tmp5" 's/__version__ = .*$/__version__ = "'"$R_NEXT_VERSION.dev0"'"/' python/pyspark/version.py
82+
7883

7984
# Update docs with next version
80-
sed -i".tmp3" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$NEXT_VERSION"'/g' docs/_config.yml
85+
sed -i".tmp6" 's/SPARK_VERSION:.*$/SPARK_VERSION: '"$NEXT_VERSION"'/g' docs/_config.yml
8186
# Use R version for short version
82-
sed -i".tmp4" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$R_NEXT_VERSION"'/g' docs/_config.yml
87+
sed -i".tmp7" 's/SPARK_VERSION_SHORT:.*$/SPARK_VERSION_SHORT: '"$R_NEXT_VERSION"'/g' docs/_config.yml
8388

8489
git commit -a -m "Preparing development version $NEXT_VERSION"
8590

dev/lint-python

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,9 @@
2020
SCRIPT_DIR="$( cd "$( dirname "$0" )" && pwd )"
2121
SPARK_ROOT_DIR="$(dirname "$SCRIPT_DIR")"
2222
PATHS_TO_CHECK="./python/pyspark/ ./examples/src/main/python/ ./dev/sparktestsupport"
23-
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/run-tests.py ./dev/run-tests-jenkins.py"
23+
# TODO: fix pep8 errors with the rest of the Python scripts under dev
24+
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/run-tests.py ./python/*.py ./dev/run-tests-jenkins.py"
25+
PATHS_TO_CHECK="$PATHS_TO_CHECK ./dev/pip-sanity-check.py"
2426
PEP8_REPORT_PATH="$SPARK_ROOT_DIR/dev/pep8-report.txt"
2527
PYLINT_REPORT_PATH="$SPARK_ROOT_DIR/dev/pylint-report.txt"
2628
PYLINT_INSTALL_INFO="$SPARK_ROOT_DIR/dev/pylint-info.txt"

dev/make-distribution.sh

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,14 +33,15 @@ SPARK_HOME="$(cd "`dirname "$0"`/.."; pwd)"
3333
DISTDIR="$SPARK_HOME/dist"
3434

3535
MAKE_TGZ=false
36+
MAKE_PIP=false
3637
NAME=none
3738
MVN="$SPARK_HOME/build/mvn"
3839

3940
function exit_with_usage {
4041
echo "make-distribution.sh - tool for making binary distributions of Spark"
4142
echo ""
4243
echo "usage:"
43-
cl_options="[--name] [--tgz] [--mvn <mvn-command>]"
44+
cl_options="[--name] [--tgz] [--pip] [--mvn <mvn-command>]"
4445
echo "make-distribution.sh $cl_options <maven build options>"
4546
echo "See Spark's \"Building Spark\" doc for correct Maven options."
4647
echo ""
@@ -67,6 +68,9 @@ while (( "$#" )); do
6768
--tgz)
6869
MAKE_TGZ=true
6970
;;
71+
--pip)
72+
MAKE_PIP=true
73+
;;
7074
--mvn)
7175
MVN="$2"
7276
shift
@@ -201,6 +205,16 @@ fi
201205
# Copy data files
202206
cp -r "$SPARK_HOME/data" "$DISTDIR"
203207

208+
# Make pip package
209+
if [ "$MAKE_PIP" == "true" ]; then
210+
echo "Building python distribution package"
211+
cd $SPARK_HOME/python
212+
python setup.py sdist
213+
cd ..
214+
else
215+
echo "Skipping creating pip installable PySpark"
216+
fi
217+
204218
# Copy other things
205219
mkdir "$DISTDIR"/conf
206220
cp "$SPARK_HOME"/conf/*.template "$DISTDIR"/conf

dev/pip-sanity-check.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#
2+
# Licensed to the Apache Software Foundation (ASF) under one or more
3+
# contributor license agreements. See the NOTICE file distributed with
4+
# this work for additional information regarding copyright ownership.
5+
# The ASF licenses this file to You under the Apache License, Version 2.0
6+
# (the "License"); you may not use this file except in compliance with
7+
# the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
#
17+
18+
from __future__ import print_function
19+
20+
from pyspark.sql import SparkSession
21+
import sys
22+
23+
if __name__ == "__main__":
24+
spark = SparkSession\
25+
.builder\
26+
.appName("PipSanityCheck")\
27+
.getOrCreate()
28+
sc = spark.sparkContext
29+
rdd = sc.parallelize(range(100), 10)
30+
value = rdd.reduce(lambda x, y: x + y)
31+
if (value != 4950):
32+
print("Value {0} did not match expected value.".format(value), file=sys.stderr)
33+
sys.exit(-1)
34+
print("Successfully ran pip sanity check")
35+
36+
spark.stop()

0 commit comments

Comments
 (0)