groupby first can return values not in group #9300

meloncholy · 2015-01-19T13:33:26Z

Not that familiar (at all :) with pandas internals, but I don't think this is expected behaviour.

f3 = DataFrame(
    [
        [95820843523155097, 1, 'director', 1],
        [95820843523155098, 1, 'director', 2],
        [95820843523155099, 1, 'director', 3],
        [95820843523155100, 2, 'director', 4],
        [95820843523155101, 2, 'computer system management (director)', 5],
        [95820843523155102, 3, 'company director', 6],
        [95820843523155103, 3, 'office manager', 7]
    ],
    columns=['uid', 'cid', 'role', 'idx']
)

f3.dtypes

uid      int64
cid      int64
role    object
idx      int64
dtype: object

Observed behaviour

f3.groupby('cid').first()

	uid	role	idx
cid
1	95820843523155104	director	1
2	95820843523155104	director	4
3	95820843523155104	company director	6

The uid column contains values that are all the same and aren't in the original data. (This isn't always true in larger sets; sometimes there's an overlap.)

Expected behaviour

f3.groupby('cid').apply(lambda g: g[:1])

		uid	role	idx
cid
1	0	95820843523155097	director	1
2	3	95820843523155100	director	4
3	5	95820843523155102	company director	6

This is what I expected to happen (i.e. the uid matches the rest of the row).

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.2.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-24-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21.1
numpy: 1.8.2
scipy: 0.14.0
statsmodels: 0.6.1
IPython: 2.3.1
sphinx: None
patsy: 0.3.0
dateutil: 2.1
pytz: 2014.9
bottleneck: 0.8.0
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.0.2
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.4
lxml: 3.4.1
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.8
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2015-01-19T14:22:22Z

Looks like a precision issue since your uids are larger than your system maxint.

I'm not sure what guarantees pandas makes in this case, but as a workaround you can covert the uids to strings df['uid'] = df.uid.astype(str)

meloncholy · 2015-01-19T15:56:15Z

You're right that converting to a string worked (thanks!), though sys.maxint is quite a bit bigger. So not sure that's the reason?

9223372036854775807 vs
95820843523155104

jreback · 2015-01-25T23:12:23Z

actually this is related to #9345 / #9311. these are getting casted to floats during the groupby. So that might fix it. But if you have very large numbers its actually better to make them object dtype which will preserve them entirely.

meloncholy · 2015-01-25T23:26:59Z

OK, good to know. Thanks!

jreback closed this as completed Jan 25, 2015

jreback added Dtype Conversions Unexpected or buggy dtype conversions Groupby labels Jan 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby first can return values not in group #9300

groupby first can return values not in group #9300

meloncholy commented Jan 19, 2015

TomAugspurger commented Jan 19, 2015

meloncholy commented Jan 19, 2015

jreback commented Jan 25, 2015

meloncholy commented Jan 25, 2015

groupby first can return values not in group #9300

groupby first can return values not in group #9300

Comments

meloncholy commented Jan 19, 2015

Observed behaviour

Expected behaviour

TomAugspurger commented Jan 19, 2015

meloncholy commented Jan 19, 2015

jreback commented Jan 25, 2015

meloncholy commented Jan 25, 2015