Skip to content

Read and write spss data format #5768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
benjello opened this issue Dec 24, 2013 · 19 comments · Fixed by #26537
Closed

Read and write spss data format #5768

benjello opened this issue Dec 24, 2013 · 19 comments · Fixed by #26537
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@benjello
Copy link
Contributor

It would be nice to be able to import spss dant with read_spss and export it using to_spss.

@benjello
Copy link
Contributor Author

@ghost
Copy link

ghost commented Dec 24, 2013

Does SPSS offer no export formats compatible with pandas? Or, Is there relevant
data contained in the proprietary file format which can't otherwise be accessed?

This may still make sense if for example users without access to SPSS frequently
get SPSS files (AKA. the microsoft word problem). Not sure if that's the case though.

@benjello
Copy link
Contributor Author

Actually, I do not use SPSS. But I have some SPSS data files that I want to explore. I may work with R but the data is so huge that this is not possible. The only way I did find is to tideous. I have to use PSPP to prepare the subset and the import it to R. With all the back and forth to add some variables and to master the SPSS syntax. Since pandas can deal with huge datasets, I do think it should provide import from SPSS. And I am willing to test it.

@ghost
Copy link

ghost commented Dec 28, 2013

That was a good enough reason for stata support, yeah.

Note that this particular package weighs in at 90-180MB, includes a large chunk of binary
proprietary (yet free to use) code and the pypi version doesn't work on linux 64bit (though the bug
was fixed two months ago on bitbucket), so It's maturity is slightly suspect.

Basic usage doesn't require any explicit pandas cooperation:

import savReaderWriter as s
df= pd.DataFrame(list(s.SavReader('foo.sav')))

is all it takes.

There's something to be said for pandas accepting data rather then data formats.
If there's a package that reads format X and produces data, then pandas implicitly
supports format X.

Definitely worth a FAQ entry though, I'm sure other users have this need.
Care to write some prose and make a PR?

@benjello
Copy link
Contributor Author

Thank you very much for looking through this problem.
I tried to import some .sav data into a pandas dataframe as you did but I ended having the following error on a win64:
WindowsError: [Error 193] %1 is not a valid Win32 application in Python

@benjello
Copy link
Contributor Author

Sorry @y-p This seems to be a reported error. I am using a 64bit python on a 64bit machine which is the case that is problematic according to this discussion https://bitbucket.org/fomcl/savreaderwriter/issue/12/win64-error

@kmfolgar
Copy link

Hi All,
any progress on this topic?
I was searching a lot about this and dont found any answer so I use this small code to import sav's to pandas but only works in Python 3.5

import pandas as pd
import numpy as np
import savWriterReader as spss

with spss.SavReaderNp ("some_sav_file.sav") as reader:
    records = reader.all()
df = pd.DataFrame(records)
df.head()

Hope to work to someone.

Have a great day!

@ozak
Copy link

ozak commented Feb 25, 2018

Lots of data is made available in SPSS so this tool would be very useful, especially for social scientists and economists. If the solution of @ThinkOnData works, it seems it should be an easy improvement. I will try it out and may submit a PR.

@jukkahuhtamaki
Copy link

jukkahuhtamaki commented Jun 3, 2018

I second the usefulness of read_spss and to_spss. I am currently entering a collaboration with a team of Information Systems scholars that use SPSS and UI-based Structural Equation Modeling tools.

Using SPSS to manage the master survey data seems to be a common approach (cf., Gaskin, 2016).

As the first step in introducing a computational approach to the collaboration, I am writing a script that preprocesses the survey data we have collected. Being able to easily read and write SPSS would certainly be helpful here.

@jorisvandenbossche
Copy link
Member

For those interested in this issue, I think contributing better pandas support (or suggesting it) to the https://bitbucket.org/fomcl/savreaderwriter would be a good first step.
If the package would have direct support to read a file into a DataFrame, and advertise this, I think it would already help a lot of people without needing to directly add it to pandas itself.

@alexwbakker
Copy link

alexwbakker commented Jul 21, 2018

SPSS is the only file format that is exportable from common survey tools like Qualtrics, SurveyGizmo, and SurveyMonkey that allows you to preserve both the values and the labels for many variables.

Survey data seems to always be represented as one row per response and one column per questions or question choice. For single select questions or Open ended , the values are usually coded as a single number per choice, so a column may have a 1, 2, or 3 for male, female, or 'prefer not to state'. If you can multi-select, most survey tools export questions like Q2_1, Q2_2, Q2_3... for each of the possible choices, and then each cell has a 1 for Selected and a null/SYSMIS/NaN value if it was not selected. Sometimes, that missing data is also coded as a -99 or other values.

Finally, SPSS has 2 properties, VARIABLE LABELS and VALUE LABELS that contain the Strings that tend to correspond to question text/choice text.

If you export a survey file as CSV from any of those tools, you are presented with the choice to either take the strings from each questions (e.g. "Strongly Agree" will be what is in each cell where that was selected) or, you can have a '5'. the trouble is that for many types of analysis you want both. In pandas, I think it would be easy to treat all of these as category labels with pd.categorical.

The common use case for this is to see what the average scale rating is for a question - e.g. in a Disagree <---> Agree scale is to get a mean response / st dev. But, you may also want to produce a cross tab that shows you count/percent for each category column. SPSS can do this pretty well, but, it is expensive, slow, and has a really lame internal syntax language.

If pandas had full support for SPSS files, and could write them out, it would be very helpful for doing both initial data exploration, question aggregation, data restructuring, text analysis, and then write out the new file to leverage other downstream tools like Wincross for reporting/analysis that require SPSS files and are easy for business/non-techncial people to use.

As someone that deals with a lot of survey data, I'm happy to talk/chat/answer any questions I can about this, and to test anything out in SPSS if it would help anyone. Note that I'm a novice when it comes to python/pandas, but I've been using SPSS for a long time and am looking to move away from it completely.

@ofajardo
Copy link

I have written a wrapper for the C library Readstat named pyreadstat which reads SPSS sav, zsav and por files: github.com/Roche/pyreadstat

@cbrnr
Copy link
Contributor

cbrnr commented May 22, 2019

It would be great if this functionality was available directly from Pandas, e.g. via read_spss. @TomAugspurger @jreback @jorisvandenbossche (sorry for the explicit mentions, I don't know how to at the whole dev team) would this be an option (given that this requires a C lib)? @ofajardo would you be willing to merge your code?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented May 22, 2019 via email

@ofajardo
Copy link

ofajardo commented May 22, 2019

@cbrnr @TomAugspurger I am willing to help and contribute

@cbrnr
Copy link
Contributor

cbrnr commented May 23, 2019

Great! So this could be as simple as adding pandas/io/spss.py and wrapping relevant functions of pyreadstat (and making sure to import it only within functions and not globally). Since only a data frame should be returned, some effort is probably be necessary in using the meta information for creating suitable column names, data types, and so on.

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now, create separate modules/functions or integrate with existing readers), but I think for now just adding support for SPSS files would be a good plan.

@TomAugspurger
Copy link
Contributor

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now,

I think ignore for now, but we can certainly revisit once we have spss taken care of. Once there's interest we can add an engine keyword to read_sas and read_stata.

@allefeld
Copy link

This issue shouldn't have been closed, since #26537 covers only the reading part.

@jreback
Copy link
Contributor

jreback commented May 17, 2022

there is no write support anywhere AFAIK -

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.