-
Notifications
You must be signed in to change notification settings - Fork 3
Velocity HOURLY #99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Velocity HOURLY #99
Conversation
Codecov Report
@@ Coverage Diff @@
## master #99 +/- ##
=======================================
Coverage 54.78% 54.78%
=======================================
Files 9 9
Lines 1150 1150
Branches 167 167
=======================================
Hits 630 630
Misses 493 493
Partials 27 27 Continue to review full report at Codecov.
|
…to_middle as a variable
The code does not shift the TIME according to seconds_to_middle. However, the portions of the code that do this are commented in case we want to go back and use this approach |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't quite got to the end of the code yet, but I think I've managed to get my head around the guts of it. There's a few things to think about.
:param is_WCUR: flag indicating if WCUR is present | ||
:return: end index of the slice | ||
""" | ||
nc_cell = nc_cell.where(nc_cell.DEPTH_quality_control < 4, drop=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why limit to < 4 and not < 3 ? With the non-velocity product we included only flags 1 & 2, so let's be consistent here.
Also, this limit would ideally be specified as an argument to the function, or at least a constant within this module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More importantly, we also need to mask out bad values of the velocity components!!
E.g. Replace UCUR
values where UCUR_quality_control > 2
with nan
before doing the resample. I think you can do this with each variable on the same DataFrame (without drop=True
), then still just do the resample.
## back the index 30min | ||
nc_cell.index = nc_cell.index - pd.Timedelta(30, units='m') | ||
|
||
nc_cell_1H = nc_cell.resample('1H') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be the reason for some of the slowness.. we are repeating this resample for every cell, even though the timestamps are the same for each cell! Not sure if xarray allows you to get around that?
Actually, the same is true for excluding the "bad" DEPTH values - this should be done only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not expect this to be the bottleneck. Anyway, without running locally, I don't know why there is the conversion to data_frame above and then resampling - xarray can use the pandas resampling method.
nc_cell['DEPTH'] = nc_cell['DEPTH'] - cell | ||
slice_end = get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, | ||
epoch, one_day, is_WCUR) | ||
CELL_INDEX[slice_start:slice_end] = np.full(slice_end - slice_start, cell_idx, dtype=np.uint32) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this ordering of values along the OBSERVATION
dimension is different from what we did in the aggregated product. In that case we cycled through all the cells at a timestamp before moving on to the next timestamp. I can see why this would be a bit more difficult here, but in any case it might be helpful to keep the two products consistent in this sense.
Co-Authored-By: Marty Hidas <[email protected]>
Co-Authored-By: Marty Hidas <[email protected]>
Co-Authored-By: Marty Hidas <[email protected]>
Co-Authored-By: Marty Hidas <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, but otherwise I think it's good for now. Let's see how the Tech Team review goes.
Main issues yet to be resolved:
- I think we should be shifting the input timestamps according to the
seconds_to_middle
attribute, when present; - Ordering of time and depth/cell along the OBSERVATION dimension is different from the aggregated product.
|
||
## NOTE: There is a possibility of having NaNs in DEPTH after the binning | ||
## this is the warning when calculating the min/max DEPTH | ||
## maybe I should clean the dataset before close it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still relevant? Doesn't using np.nanmin()
get around this problem?
aodntools/timeseries_products/velocity_hourly_timeseries_template.json
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_hourly_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/Documentation/velocity_hourly_timeseries.md
Outdated
Show resolved
Hide resolved
- All files to be aggregated are from the same site, and have the same `site_code` attribute; | ||
- Variables to be aggregated have `TIME` and (optionally) `HEIGHT_ABOVE_SENSOR` as their only dimensions (or if `LATITUDE` and `LONGITUDE` are included as dimensions, they have size 1); | ||
- The in-water data are bounded by the global attributes `time_deployment_start` and `time_deployment_end`; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need to add this if it is so decided:
The TIME
variable has an attribute seconds_to_middle_of_measurement
to indicate the offset from each recorded timestamp to the centre of the averaging period.
aodntools/timeseries_products/Documentation/velocity_hourly_timeseries.md
Outdated
Show resolved
Hide resolved
aodntools/timeseries_products/velocity_hourly_timeseries_template.json
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a couple of suggestions in the global attributes, to warn users that the timestamps have not been centred before binning.
aodntools/timeseries_products/velocity_hourly_timeseries_template.json
Outdated
Show resolved
Hide resolved
(also update description of CELL_INDEX and SECONDS_TO_MIDDLE)
* make cell_methods CF compliant * more accurate description in long_name * update author_email to AODN one
* reorder imports * global variable for max QC flag to include * minor doc string edits * clearer variable renames
(add comments about not centering the timestamps before binning)
Note the Travis tests are failing due to https://github.com/aodn/issues/issues/692, but a fix is almost ready. |
@@ -41,14 +39,14 @@ def cell_velocity_resample(df, binning_function, is_WCUR): | |||
return UCUR, VCUR, WCUR, DEPTH | |||
|
|||
|
|||
def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch, one_day, is_WCUR): | |||
def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_function, epoch, one_day, is_WCUR): | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too many arguments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. I have removed varlist
and the last three.
@@ -73,7 +71,7 @@ def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch, | |||
ds['WCUR'][slice_start:slice_end], \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mutation of input arguments in the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename the function to highlight that I'm mutating ds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to append_resampled_values
, and I've made this clear in the docstring too
@@ -54,14 +54,14 @@ def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_function, ep | |||
""" | |||
df_cell = nc_cell[varlist].squeeze().to_dataframe() | |||
## back the index 30min | |||
df_cell.index = df_cell.index - pd.Timedelta(minutes=30) | |||
df_cell.index = df_cell.index + pd.Timedelta(minutes=30) | |||
# TODO: shift timestamps to centre of sampling interval |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment not updated with code - you are advancing the index instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine the code was wrong before, so the commit could highlight that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wasn't wrong, it just took two steps instead of one: it subtracted 30min, resampled, then added an hour.
So I simplified it to just adding 30min, which is exactly equivalent. I did a test comparison and all the variables in files generated with the two version of the code are exactly the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment changed to "shift the index forward 30min to centre the bins on the hour"
@mhidas - I'm leaving to you to merge of just a single thing - there a change on the time resampling (forward shifting instead of backward) that I assumed was further checked? |
Thanks @ocehugo. I will update that commit, fix the merge conflicts (due to version numbers changing after the other PR was merged) and merge. |
This code aggregates the flattened velocity variables binned into one hour time intervals. It checks for the presence of seconds_to_middle_of_measurement and lags forward the timestamp.
Here is a sample file aggregating 20 deployments from NRSDAR