-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Add reader for SPSS (.sav) files #26537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add this to the CI in several (but not all places); this should gracefully skip tests if not installed. you can commit the .sav files pandas/tests/io/data/ (see how we do this for .dta). also update the install.rst
Codecov Report
@@ Coverage Diff @@
## master #26537 +/- ##
===========================================
- Coverage 91.76% 41.7% -50.07%
===========================================
Files 174 175 +1
Lines 50629 50637 +8
===========================================
- Hits 46462 21119 -25343
- Misses 4167 29518 +25351
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #26537 +/- ##
==========================================
- Coverage 91.88% 91.86% -0.03%
==========================================
Files 179 180 +1
Lines 50696 50710 +14
==========================================
+ Hits 46581 46583 +2
- Misses 4115 4127 +12
Continue to review full report at Codecov.
|
I've added some test files from the haven package - how do we attribute the original authors? They don't have a standard license, so maybe we have to ask them for permission to use their test files? One test file containing dates cannot be loaded because I've updated |
Seems like haven has an MIT license, according to https://cran.r-project.org/web/packages/haven/index.html. Does that sound right @hadley? If so, I think you can include include Haven's license file in our licenses folder, and we'll be good. It'd be good to note the source of these tests files in |
Yeah, that's fine with me. (I generally follow the US and consider data to be un-copyrightable, although in this case I guess it's the specific form that it's important not so much the data) |
All tests pass now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put licenses in pandas/LICENSES
ci/deps/travis-37.yaml
Outdated
@@ -22,3 +22,4 @@ dependencies: | |||
- pip | |||
- pip: | |||
- moto | |||
- pyreadstat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this only a wheel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add to one of the windows builds & the macosx build; is this support on 3.5? any other requirements?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what you mean by "only a wheel" though. pyreadstat
has binary wheels for Windows, Linux, and macOS for Python 3.5, 3.6, and 3.7: https://pypi.org/project/pyreadstat/#files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning you are installing thru pip, rather use a conda package if its available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock
pandas/io/spss.py
Outdated
@@ -0,0 +1,27 @@ | |||
def read_spss(path, usecols=None, categorical=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add typing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This is the first time I've used typing so please double-check.
@jreback I've implemented all changes. |
ci/deps/travis-37.yaml
Outdated
@@ -22,3 +22,4 @@ dependencies: | |||
- pip | |||
- pip: | |||
- moto | |||
- pyreadstat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning you are installing thru pip, rather use a conda package if its available
pandas/io/spss.py
Outdated
|
||
Parameters | ||
---------- | ||
path : string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this accept a pathlike? Union[str, Path] ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pathlib.Path
should work everywhere a path string is expected, so IMO it is not really necessary to explicitly add this. But if you want I can of course add it (this requires an extra import pathlib
though).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've implemented this change.
Yes, |
Fixed the formatting issues, hopefully this will come back green and then it should be ready to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit on annotation but otherwise this lgtm - nice change!
@WillAyd could you please elaborate? |
Not sure what happened to my comment but was asking to change |
@WillAyd done! |
ci/deps/travis-37.yaml
Outdated
@@ -22,3 +22,4 @@ dependencies: | |||
- pip | |||
- pip: | |||
- moto | |||
- pyreadstat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to be available through conda-forge: https://github.com/conda-forge/pyreadstat-feedstock
@cbrnr Pyreadstat is available both with pip and conda. In the README that is explained in the How to install section |
I didn't see that you have conda-forge added, sorry about that. I've addressed all comments. |
No idea what's going on with Azure, could someone please restart? |
can you merge master (you have a conflict) |
OK, I've rebased. |
looks good. @jorisvandenbossche @TomAugspurger @bashtage ok with this? |
Finally all green! |
thanks @cbrnr |
@cbrnr I am in the process of releasing a new version of pyreadstat with writing capabilities, in case that's of your interest. Should be there later today or tomorrow. |
Nice! However, at the moment I don't think there's a need for exporting to a proprietary format directly from pandas. If someone really wants to do that they can use your package directly. |
Still missing documentation http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html |
Can you open a PR fixing that? Or a new issue if you don't plan to?
…On Fri, Jul 19, 2019 at 8:16 AM Ignacio Santolin ***@***.***> wrote:
Still missing documentation
http://pandas-docs.github.io/pandas-docs-travis/user_guide/io.html
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26537?email_source=notifications&email_token=AAKAOIUY5EKTLKJB46MV5M3QAG5DRA5CNFSM4HP3GQF2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2LS6KY#issuecomment-513224491>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOIU5ZPA6JGX7HGCVAITQAG5DRANCNFSM4HP3GQFQ>
.
|
git diff upstream/master -u -- "*.py" | flake8 --diff
I haven't added a test yet because I wanted to ask which test .sav file I should use (and where to put it). Also, there's no whatsnew entry yet - in which section should this entry go (and I assume it will be out for 0.25.0, so I'll have to change 0.24.3 to 0.25.0 in the docstring).
This PR adds the capability to load SPSS .sav files with
df = pd.io.read_spss("spss_file.sav")
. Currently, there are two optional arguments:usecols
(should be self-explanatory, let me know if you don't want me to handle a simplestr
) andcategorical
, which maps to theapply_value_formats
parameter inread_sav
. Withcategorical=True
, a categorical columns is created with the labels from the .sav file. IfFalse
, numbers will be used.A few open questions:
dates_as_pandas_datetime
,encoding
, anduser_missing
which I haven't mapped yet.read_spss
orread_sav
? SPSS files have the extensionsav
, but the R haven package has a functionread_spss
(which is why I'd preferread_spss
).pyreadstat.read_sav
returns a dataframe and meta-information separately, which I think we shouldn't do in pandas.