Skip to content

Velocity HOURLY #99

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Apr 2, 2020
Merged

Velocity HOURLY #99

merged 40 commits into from
Apr 2, 2020

Conversation

diodon
Copy link
Contributor

@diodon diodon commented Feb 20, 2020

This code aggregates the flattened velocity variables binned into one hour time intervals. It checks for the presence of seconds_to_middle_of_measurement and lags forward the timestamp.
Here is a sample file aggregating 20 deployments from NRSDAR

@diodon diodon requested a review from mhidas February 20, 2020 06:33
@diodon diodon self-assigned this Feb 20, 2020
@codecov
Copy link

codecov bot commented Feb 20, 2020

Codecov Report

Merging #99 into master will not change coverage by %.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #99   +/-   ##
=======================================
  Coverage   54.78%   54.78%           
=======================================
  Files           9        9           
  Lines        1150     1150           
  Branches      167      167           
=======================================
  Hits          630      630           
  Misses        493      493           
  Partials       27       27           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fef2508...fef2508. Read the comment docs.

@diodon
Copy link
Contributor Author

diodon commented Mar 5, 2020

The code does not shift the TIME according to seconds_to_middle. However, the portions of the code that do this are commented in case we want to go back and use this approach

Copy link
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't quite got to the end of the code yet, but I think I've managed to get my head around the guts of it. There's a few things to think about.

:param is_WCUR: flag indicating if WCUR is present
:return: end index of the slice
"""
nc_cell = nc_cell.where(nc_cell.DEPTH_quality_control < 4, drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why limit to < 4 and not < 3 ? With the non-velocity product we included only flags 1 & 2, so let's be consistent here.
Also, this limit would ideally be specified as an argument to the function, or at least a constant within this module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More importantly, we also need to mask out bad values of the velocity components!!
E.g. Replace UCUR values where UCUR_quality_control > 2 with nan before doing the resample. I think you can do this with each variable on the same DataFrame (without drop=True), then still just do the resample.

## back the index 30min
nc_cell.index = nc_cell.index - pd.Timedelta(30, units='m')

nc_cell_1H = nc_cell.resample('1H')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be the reason for some of the slowness.. we are repeating this resample for every cell, even though the timestamps are the same for each cell! Not sure if xarray allows you to get around that?
Actually, the same is true for excluding the "bad" DEPTH values - this should be done only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not expect this to be the bottleneck. Anyway, without running locally, I don't know why there is the conversion to data_frame above and then resampling - xarray can use the pandas resampling method.

nc_cell['DEPTH'] = nc_cell['DEPTH'] - cell
slice_end = get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun,
epoch, one_day, is_WCUR)
CELL_INDEX[slice_start:slice_end] = np.full(slice_end - slice_start, cell_idx, dtype=np.uint32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this ordering of values along the OBSERVATION dimension is different from what we did in the aggregated product. In that case we cycled through all the cells at a timestamp before moving on to the next timestamp. I can see why this would be a bit more difficult here, but in any case it might be helpful to keep the two products consistent in this sense.

Copy link
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments, but otherwise I think it's good for now. Let's see how the Tech Team review goes.

Main issues yet to be resolved:

  • I think we should be shifting the input timestamps according to the seconds_to_middle attribute, when present;
  • Ordering of time and depth/cell along the OBSERVATION dimension is different from the aggregated product.


## NOTE: There is a possibility of having NaNs in DEPTH after the binning
## this is the warning when calculating the min/max DEPTH
## maybe I should clean the dataset before close it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still relevant? Doesn't using np.nanmin() get around this problem?

- All files to be aggregated are from the same site, and have the same `site_code` attribute;
- Variables to be aggregated have `TIME` and (optionally) `HEIGHT_ABOVE_SENSOR` as their only dimensions (or if `LATITUDE` and `LONGITUDE` are included as dimensions, they have size 1);
- The in-water data are bounded by the global attributes `time_deployment_start` and `time_deployment_end`;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need to add this if it is so decided:
The TIME variable has an attribute seconds_to_middle_of_measurement to indicate the offset from each recorded timestamp to the centre of the averaging period.

Copy link
Contributor

@mhidas mhidas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a couple of suggestions in the global attributes, to warn users that the timestamps have not been centred before binning.

mhidas added 9 commits March 31, 2020 17:07
(also update description of CELL_INDEX and SECONDS_TO_MIDDLE)
* make cell_methods CF compliant
* more accurate description in long_name
* update author_email to AODN one
* reorder imports
* global variable for max QC flag to include
* minor doc string edits
* clearer variable renames
(add comments about not centering the timestamps before binning)
@mhidas mhidas requested a review from ocehugo April 1, 2020 02:00
@mhidas
Copy link
Contributor

mhidas commented Apr 1, 2020

@ocehugo If you have a bit of time today, please take a quick look at this & merge unless you find anything really broken. You only really need to look at the last 9 commits (starting with bd151bc), as the rest were done by @diodon and reviewed by me.

@mhidas
Copy link
Contributor

mhidas commented Apr 1, 2020

Note the Travis tests are failing due to https://github.com/aodn/issues/issues/692, but a fix is almost ready.

@@ -41,14 +39,14 @@ def cell_velocity_resample(df, binning_function, is_WCUR):
return UCUR, VCUR, WCUR, DEPTH


def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch, one_day, is_WCUR):
def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_function, epoch, one_day, is_WCUR):
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many arguments

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. I have removed varlist and the last three.

@@ -73,7 +71,7 @@ def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_fun, epoch,
ds['WCUR'][slice_start:slice_end], \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutation of input arguments in the function

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename the function to highlight that I'm mutating ds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to append_resampled_values, and I've made this clear in the docstring too

@@ -54,14 +54,14 @@ def get_resampled_values(nc_cell, ds, slice_start, varlist, binning_function, ep
"""
df_cell = nc_cell[varlist].squeeze().to_dataframe()
## back the index 30min
df_cell.index = df_cell.index - pd.Timedelta(minutes=30)
df_cell.index = df_cell.index + pd.Timedelta(minutes=30)
# TODO: shift timestamps to centre of sampling interval
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment not updated with code - you are advancing the index instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine the code was wrong before, so the commit could highlight that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't wrong, it just took two steps instead of one: it subtracted 30min, resampled, then added an hour.
So I simplified it to just adding 30min, which is exactly equivalent. I did a test comparison and all the variables in files generated with the two version of the code are exactly the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment changed to "shift the index forward 30min to centre the bins on the hour"

@ocehugo
Copy link
Contributor

ocehugo commented Apr 1, 2020

@mhidas - I'm leaving to you to merge of just a single thing - there a change on the time resampling (forward shifting instead of backward) that I assumed was further checked?

@mhidas
Copy link
Contributor

mhidas commented Apr 2, 2020

Thanks @ocehugo. I will update that commit, fix the merge conflicts (due to version numbers changing after the other PR was merged) and merge.

@mhidas mhidas merged commit 63abc6d into master Apr 2, 2020
@mhidas mhidas deleted the velocity_hourly branch April 2, 2020 06:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants