Skip to content

groupby first can return values not in group #9300

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
meloncholy opened this issue Jan 19, 2015 · 4 comments
Closed

groupby first can return values not in group #9300

meloncholy opened this issue Jan 19, 2015 · 4 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby

Comments

@meloncholy
Copy link

Not that familiar (at all :) with pandas internals, but I don't think this is expected behaviour.

f3 = DataFrame(
    [
        [95820843523155097, 1, 'director', 1],
        [95820843523155098, 1, 'director', 2],
        [95820843523155099, 1, 'director', 3],
        [95820843523155100, 2, 'director', 4],
        [95820843523155101, 2, 'computer system management (director)', 5],
        [95820843523155102, 3, 'company director', 6],
        [95820843523155103, 3, 'office manager', 7]
    ],
    columns=['uid', 'cid', 'role', 'idx']
)

f3.dtypes
uid      int64
cid      int64
role    object
idx      int64
dtype: object

Observed behaviour

f3.groupby('cid').first()
uid role idx
cid
1 95820843523155104 director 1
2 95820843523155104 director 4
3 95820843523155104 company director 6

The uid column contains values that are all the same and aren't in the original data. (This isn't always true in larger sets; sometimes there's an overlap.)

Expected behaviour

f3.groupby('cid').apply(lambda g: g[:1])
uid role idx
cid
1 0 95820843523155097 director 1
2 3 95820843523155100 director 4
3 5 95820843523155102 company director 6

This is what I expected to happen (i.e. the uid matches the rest of the row).

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.1
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.4
lxml: 3.4.1
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)
@TomAugspurger
Copy link
Contributor

Looks like a precision issue since your uids are larger than your system maxint.

I'm not sure what guarantees pandas makes in this case, but as a workaround you can covert the uids to strings df['uid'] = df.uid.astype(str)

@meloncholy
Copy link
Author

You're right that converting to a string worked (thanks!), though sys.maxint is quite a bit bigger. So not sure that's the reason?

9223372036854775807 vs
95820843523155104

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

actually this is related to #9345 / #9311. these are getting casted to floats during the groupby. So that might fix it. But if you have very large numbers its actually better to make them object dtype which will preserve them entirely.

@jreback jreback closed this as completed Jan 25, 2015
@jreback jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jan 25, 2015
@meloncholy
Copy link
Author

OK, good to know. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Groupby
Projects
None yet
Development

No branches or pull requests

3 participants