Read and write spss data format #5768

benjello · 2013-12-24T00:13:06Z

It would be nice to be able to import spss dant with read_spss and export it using to_spss.

benjello · 2013-12-24T00:31:25Z

The following package might be useful https://www.ibm.com/developerworks/community/files/app?lang=en#/person/270003ERMT/file/b77a0da0-2f47-454b-b505-5404b242d78c

ghost · 2013-12-24T00:32:26Z

Does SPSS offer no export formats compatible with pandas? Or, Is there relevant
data contained in the proprietary file format which can't otherwise be accessed?

This may still make sense if for example users without access to SPSS frequently
get SPSS files (AKA. the microsoft word problem). Not sure if that's the case though.

benjello · 2013-12-28T17:41:53Z

Actually, I do not use SPSS. But I have some SPSS data files that I want to explore. I may work with R but the data is so huge that this is not possible. The only way I did find is to tideous. I have to use PSPP to prepare the subset and the import it to R. With all the back and forth to add some variables and to master the SPSS syntax. Since pandas can deal with huge datasets, I do think it should provide import from SPSS. And I am willing to test it.

ghost · 2013-12-28T18:43:11Z

That was a good enough reason for stata support, yeah.

Note that this particular package weighs in at 90-180MB, includes a large chunk of binary
proprietary (yet free to use) code and the pypi version doesn't work on linux 64bit (though the bug
was fixed two months ago on bitbucket), so It's maturity is slightly suspect.

Basic usage doesn't require any explicit pandas cooperation:

import savReaderWriter as s
df= pd.DataFrame(list(s.SavReader('foo.sav')))

is all it takes.

There's something to be said for pandas accepting data rather then data formats.
If there's a package that reads format X and produces data, then pandas implicitly
supports format X.

Definitely worth a FAQ entry though, I'm sure other users have this need.
Care to write some prose and make a PR?

benjello · 2013-12-29T17:48:39Z

Thank you very much for looking through this problem.
I tried to import some .sav data into a pandas dataframe as you did but I ended having the following error on a win64:
WindowsError: [Error 193] %1 is not a valid Win32 application in Python

benjello · 2013-12-29T18:01:49Z

Sorry @y-p This seems to be a reported error. I am using a 64bit python on a 64bit machine which is the case that is problematic according to this discussion https://bitbucket.org/fomcl/savreaderwriter/issue/12/win64-error

kmfolgar · 2017-09-22T19:10:43Z

Hi All,
any progress on this topic?
I was searching a lot about this and dont found any answer so I use this small code to import sav's to pandas but only works in Python 3.5

import pandas as pd
import numpy as np
import savWriterReader as spss

with spss.SavReaderNp ("some_sav_file.sav") as reader:
    records = reader.all()
df = pd.DataFrame(records)
df.head()

Hope to work to someone.

Have a great day!

ozak · 2018-02-25T03:16:45Z

Lots of data is made available in SPSS so this tool would be very useful, especially for social scientists and economists. If the solution of @ThinkOnData works, it seems it should be an easy improvement. I will try it out and may submit a PR.

jukkahuhtamaki · 2018-06-03T17:04:41Z

I second the usefulness of read_spss and to_spss. I am currently entering a collaboration with a team of Information Systems scholars that use SPSS and UI-based Structural Equation Modeling tools.

Using SPSS to manage the master survey data seems to be a common approach (cf., Gaskin, 2016).

As the first step in introducing a computational approach to the collaboration, I am writing a script that preprocesses the survey data we have collected. Being able to easily read and write SPSS would certainly be helpful here.

jorisvandenbossche · 2018-06-04T11:51:08Z

For those interested in this issue, I think contributing better pandas support (or suggesting it) to the https://bitbucket.org/fomcl/savreaderwriter would be a good first step.
If the package would have direct support to read a file into a DataFrame, and advertise this, I think it would already help a lot of people without needing to directly add it to pandas itself.

alexwbakker · 2018-07-21T17:33:16Z

SPSS is the only file format that is exportable from common survey tools like Qualtrics, SurveyGizmo, and SurveyMonkey that allows you to preserve both the values and the labels for many variables.

Survey data seems to always be represented as one row per response and one column per questions or question choice. For single select questions or Open ended , the values are usually coded as a single number per choice, so a column may have a 1, 2, or 3 for male, female, or 'prefer not to state'. If you can multi-select, most survey tools export questions like Q2_1, Q2_2, Q2_3... for each of the possible choices, and then each cell has a 1 for Selected and a null/SYSMIS/NaN value if it was not selected. Sometimes, that missing data is also coded as a -99 or other values.

Finally, SPSS has 2 properties, VARIABLE LABELS and VALUE LABELS that contain the Strings that tend to correspond to question text/choice text.

If you export a survey file as CSV from any of those tools, you are presented with the choice to either take the strings from each questions (e.g. "Strongly Agree" will be what is in each cell where that was selected) or, you can have a '5'. the trouble is that for many types of analysis you want both. In pandas, I think it would be easy to treat all of these as category labels with pd.categorical.

The common use case for this is to see what the average scale rating is for a question - e.g. in a Disagree <---> Agree scale is to get a mean response / st dev. But, you may also want to produce a cross tab that shows you count/percent for each category column. SPSS can do this pretty well, but, it is expensive, slow, and has a really lame internal syntax language.

If pandas had full support for SPSS files, and could write them out, it would be very helpful for doing both initial data exploration, question aggregation, data restructuring, text analysis, and then write out the new file to leverage other downstream tools like Wincross for reporting/analysis that require SPSS files and are easy for business/non-techncial people to use.

As someone that deals with a lot of survey data, I'm happy to talk/chat/answer any questions I can about this, and to test anything out in SPSS if it would help anyone. Note that I'm a novice when it comes to python/pandas, but I've been using SPSS for a long time and am looking to move away from it completely.

ofajardo · 2018-08-26T16:22:58Z

I have written a wrapper for the C library Readstat named pyreadstat which reads SPSS sav, zsav and por files: github.com/Roche/pyreadstat

cbrnr · 2019-05-22T09:20:16Z

It would be great if this functionality was available directly from Pandas, e.g. via read_spss. @TomAugspurger @jreback @jorisvandenbossche (sorry for the explicit mentions, I don't know how to at the whole dev team) would this be an option (given that this requires a C lib)? @ofajardo would you be willing to merge your code?

TomAugspurger · 2019-05-22T11:38:37Z

It’s more likely that we would have an optional dependency on that package that a read_spss would use. Similar to what we do with pyarrow and parquet.

…

On May 22, 2019, at 04:20, Clemens Brunner ***@***.***> wrote: It would be great if this functionality was available directly from Pandas, e.g. via read_spss. @TomAugspurger @jreback @jorisvandenbossche (sorry for the explicit mentions, I don't know how to at the whole dev team) would this be an option (given that this requires a C lib)? @ofajardo would you be willing to merge your code? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ofajardo · 2019-05-22T17:55:36Z

@cbrnr @TomAugspurger I am willing to help and contribute

cbrnr · 2019-05-23T06:22:47Z

Great! So this could be as simple as adding pandas/io/spss.py and wrapping relevant functions of pyreadstat (and making sure to import it only within functions and not globally). Since only a data frame should be returned, some effort is probably be necessary in using the meta information for creating suitable column names, data types, and so on.

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now, create separate modules/functions or integrate with existing readers), but I think for now just adding support for SPSS files would be a good plan.

TomAugspurger · 2019-05-23T16:40:54Z

Also,pyreadstat can also read SAS and Stata files, which Pandas already supports natively (but pyreadstat is much faster). I don't know how you would like to handle these file formats (ignore for now,

I think ignore for now, but we can certainly revisit once we have spss taken care of. Once there's interest we can add an engine keyword to read_sas and read_stata.

allefeld · 2022-05-17T21:46:41Z

This issue shouldn't have been closed, since #26537 covers only the reading part.

jreback · 2022-05-17T21:56:44Z

there is no write support anywhere AFAIK -

cbrnr mentioned this issue May 27, 2019

Add reader for SPSS (.sav) files #26537

Merged

4 tasks

jreback modified the milestones: Someday, 0.25.0 Jun 3, 2019

jreback closed this as completed in #26537 Jun 16, 2019

ofajardo mentioned this issue Dec 9, 2020

Potential bug in reading SAS files with CHAR (RLE) compression and many repeated characters #31243

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read and write spss data format #5768

Read and write spss data format #5768

benjello commented Dec 24, 2013

benjello commented Dec 24, 2013

ghost commented Dec 24, 2013

benjello commented Dec 28, 2013

ghost commented Dec 28, 2013

benjello commented Dec 29, 2013

benjello commented Dec 29, 2013

kmfolgar commented Sep 22, 2017

ozak commented Feb 25, 2018

jukkahuhtamaki commented Jun 3, 2018 •

edited

Loading

jorisvandenbossche commented Jun 4, 2018

alexwbakker commented Jul 21, 2018 •

edited

Loading

ofajardo commented Aug 26, 2018

cbrnr commented May 22, 2019

TomAugspurger commented May 22, 2019 via email

ofajardo commented May 22, 2019 •

edited

Loading

cbrnr commented May 23, 2019

TomAugspurger commented May 23, 2019

allefeld commented May 17, 2022

jreback commented May 17, 2022

Read and write spss data format #5768

Read and write spss data format #5768

Comments

benjello commented Dec 24, 2013

benjello commented Dec 24, 2013

ghost commented Dec 24, 2013

benjello commented Dec 28, 2013

ghost commented Dec 28, 2013

benjello commented Dec 29, 2013

benjello commented Dec 29, 2013

kmfolgar commented Sep 22, 2017

ozak commented Feb 25, 2018

jukkahuhtamaki commented Jun 3, 2018 • edited Loading

jorisvandenbossche commented Jun 4, 2018

alexwbakker commented Jul 21, 2018 • edited Loading

ofajardo commented Aug 26, 2018

cbrnr commented May 22, 2019

TomAugspurger commented May 22, 2019 via email

ofajardo commented May 22, 2019 • edited Loading

cbrnr commented May 23, 2019

TomAugspurger commented May 23, 2019

allefeld commented May 17, 2022

jreback commented May 17, 2022

jukkahuhtamaki commented Jun 3, 2018 •

edited

Loading

alexwbakker commented Jul 21, 2018 •

edited

Loading

ofajardo commented May 22, 2019 •

edited

Loading