Skip to content

The execution report cannot handle big workflows #547

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pditommaso opened this issue Dec 7, 2017 · 31 comments
Closed

The execution report cannot handle big workflows #547

pditommaso opened this issue Dec 7, 2017 · 31 comments
Milestone

Comments

@pditommaso
Copy link
Member

Currently all tasks metadata and metrics are embedded as a JSON object in the execution report HTML file.

For big workflow executions ie. spawning 50K tasks or more, the resulting become to big (>100MB) and browser are not able to handle it.

The goal of this feature is to implement a better strategy to handle a large number of tasks.

/cc @ewels we need to brainstorm a bit about this at some point.

@pditommaso pditommaso added this to the v0.27.0 milestone Dec 7, 2017
@ewels
Copy link
Member

ewels commented Dec 7, 2017

Hah, this is a familiar problem.. :) Can perhaps drop the table and generate the plots at run time to embed in the HTML?

@pditommaso
Copy link
Member Author

It could be an option, but it should still parse the big JSON payload to compute the values displayed in the chart, shouldn't it? Maybe we could compute those values server side. Does it make sense ?

@ewels
Copy link
Member

ewels commented Dec 7, 2017

Yes exactly, that's what I meant by run time. So in MultiQC I have a max number of samples and after that I generate flat image plots, which scale to infinity without increasing file size. The Plotly library that we're using for the report has built-in options for exporting to flat images like this, at least in the Python version: https://plot.ly/python/static-image-export/

Phil

@pditommaso
Copy link
Member Author

OK, but wouldn't be even easier moving this code NF side and leave the rendering on the client?

      for (var i = 0; i < window.data_byprocess[pname].length; i ++) {
        rc[i] = xround(parseInt(window.data_byprocess[pname][i]['%cpu']));
        pc[i] = xround((parseInt(window.data_byprocess[pname][i]['%cpu']) / (parseInt(window.data_byprocess[pname][i]['cpus']) * 100)) * 100);
        rm[i] = xround(parseInt(window.data_byprocess[pname][i]['vmem']) / 1000000000);
        pm[i] = xround((parseInt(window.data_byprocess[pname][i]['vmem']) / parseInt(window.data_byprocess[pname][i]['memory'])) * 100)
        rt[i] = xround(moment.duration( parseInt(window.data_byprocess[pname][i]['realtime']) ).asMinutes());
        pt[i] = xround((parseInt(window.data_byprocess[pname][i]['realtime']) / parseInt(window.data_byprocess[pname][i]['time'])) * 100)
        rd[i] = xround((parseInt(window.data_byprocess[pname][i]['read_bytes']) + parseInt(window.data_byprocess[pname][i]['write_bytes'])) / 1000000000);
        if (parseInt(window.data_byprocess[pname][i]['time']) > 0){ time_alloc_hasdata = true; }
        if (rd[i] > 0){ readwrite_hasdata = true; }
      }

@ewels
Copy link
Member

ewels commented Dec 7, 2017

Yes, if the raw plot data isn't too big I guess. If we have 10s or 100s thousand of tasks with four numbers each that could still be quite a bit of data and browser processor load. But yes maybe let's try that first!

Phil

@pditommaso
Copy link
Member Author

Wait, basically the above code creates a separate series for each process name, but it still need to hold in memory and parse all tasks.

I was thinking it was calculating the min,max,median,etc values.

Is it not possible to give Plotly just the final values to render?

@ewels
Copy link
Member

ewels commented Dec 7, 2017

Apparently not 😞 plotly/plotly.js#1059

@pditommaso
Copy link
Member Author

Too bad. This only leave the NF side image rendering, but I can't find the Java client for Plotly. Maybe it's possible to use the JS one with the Java embedded JavaScript engine ..

@ewels
Copy link
Member

ewels commented Dec 7, 2017

😰

It might be worth trying with just the numbers first. I imagine that 99% of the filesize will be the other JSON stuff.

@pditommaso
Copy link
Member Author

Not so sure. I will give a try to see if it's possible to render a chart with Nashorn.

@pditommaso
Copy link
Member Author

Wait, it seems it's possible to render a box plot just providing the Y data. Look for an example here. I guess this would give us much more control, including a way to fix #504.

What do you think?

@ewels
Copy link
Member

ewels commented Dec 11, 2017

Brilliant! This is totally cheating but why not??

@pditommaso
Copy link
Member Author

pditommaso commented Dec 11, 2017

Good. Being so if I implement the computation of [min, q1, median, q3, max] for each process, would you be able to render it? does it sound a good plan?

@ewels
Copy link
Member

ewels commented Dec 11, 2017

Yes, sounds great 👍

I hesitated about what the whiskers show - the min / max or some other metric. Plotly has some documentation on this here: https://help.plot.ly/what-is-a-box-plot/#the-whiskers - I guess the current plots show the min / max? May be worth double checking though to be sure that we're doing the same thing as normal reports.

@pditommaso
Copy link
Member Author

OK, I will check that.

@pditommaso
Copy link
Member Author

So, i've double checked this and the whiskers show the min and the max. How would you need the json data structured to render these info?

@ewels
Copy link
Member

ewels commented Jan 14, 2018

Great! Doesn't matter too much what the JSON looks like, I can work around it. At the moment we have window.data['trace'] that contains all of the stats for each task. Could instead have window.data['process'] that has these summary stats for the processes? Then if the former isn't defined the javascript can use the latter instead.

But whatever is easiest for you really - as long as the task JSON isn't printed to the HTML (to avoid the filesize) and the new summaries are then we should be able to adapt the javascript accordingly.

Phil

@pditommaso
Copy link
Member Author

Ok. Let's try to speedup this. Could you provide the javascript snippet to render that charts using a fake json structure? then I can generate it dynamically on NF side.

@ewels
Copy link
Member

ewels commented Jan 15, 2018

Is it not nicer to just print the data into the report as JSON and then use the embedded javascript file that we already have for the javascript code?

Nothing fancy required. From the top of my head:

{
    "process_summary": {
        "time": [
            [ "fastqc", 12, 24, 35, 46, 90 ],
            [ "bwa", 120, 230, 340, 450, 560 ],
        ],
        "cpu": [
            [ "fastqc", 20, 50, 76, 100, 140 ],
        ],
    }
}

Where the arrays are process name, min, lower quartile, median, upper quartile, max. Or whatever.

Once we have this in as JSON in the report (the above printed in the same way as the existing data) I can plug it into the existing plot code pretty easily.

Phil

@pditommaso
Copy link
Member Author

What about a structure like this:

{
   "bwa-mem":{
      "cpu":{
         "min":1000.0,
         "minLabel":"bwa-mem-1",
         "max":11000.0,
         "maxLabel":"bwa-mem-3",
         "q1":3500.0,
         "q2":6000.0,
         "q3":8500.0,
         "mean":6000.0
      },
      "mem":{
         "min":2000.0,
         "minLabel":"bwa-mem-1",
         "max":12000.0,
         "maxLabel":"bwa-mem-3",
         "q1":4500.0,
         "q2":7000.0,
         "q3":9500.0,
         "mean":7000.0
      },
      "time":{
         "min":3000.0,
         "minLabel":"bwa-mem-1",
         "max":13000.0,
         "maxLabel":"bwa-mem-3",
         "q1":5500.0,
         "q2":8000.0,
         "q3":10500.0,
         "mean":8000.0
      },
      "reads":{
         "min":4000.0,
         "minLabel":"bwa-mem-1",
         "max":14000.0,
         "maxLabel":"bwa-mem-3",
         "q1":6500.0,
         "q2":9000.0,
         "q3":11500.0,
         "mean":9000.0
      },
      "writes":{
         "min":5000.0,
         "minLabel":"bwa-mem-1",
         "max":15000.0,
         "maxLabel":"bwa-mem-3",
         "q1":7500.0,
         "q2":10000.0,
         "q3":12500.0,
         "mean":10000.0
      }
   },
   "multiqc":{
      "cpu":{
         "min":16000.0,
         "minLabel":"multiqc-1",
         "max":21000.0,
         "maxLabel":"multiqc-2",
         "q1":17250.0,
         "q2":18500.0,
         "q3":19750.0,
         "mean":18500.0
      },
      "mem":{
         "min":17000.0,
         "minLabel":"multiqc-1",
         "max":22000.0,
         "maxLabel":"multiqc-2",
         "q1":18250.0,
         "q2":19500.0,
         "q3":20750.0,
         "mean":19500.0
      },
      "time":{
         "min":18000.0,
         "minLabel":"multiqc-1",
         "max":23000.0,
         "maxLabel":"multiqc-2",
         "q1":19250.0,
         "q2":20500.0,
         "q3":21750.0,
         "mean":20500.0
      },
      "reads":{
         "min":19000.0,
         "minLabel":"multiqc-1",
         "max":24000.0,
         "maxLabel":"multiqc-2",
         "q1":20250.0,
         "q2":21500.0,
         "q3":22750.0,
         "mean":21500.0
      },
      "writes":{
         "min":20000.0,
         "minLabel":"multiqc-1",
         "max":25000.0,
         "maxLabel":"multiqc-2",
         "q1":21250.0,
         "q2":22500.0,
         "q3":23750.0,
         "mean":22500.0
      }
   }
}

@ewels
Copy link
Member

ewels commented Jan 23, 2018

Perfect! Labels are maybe not really needed, but doesn't hurt.

@pditommaso
Copy link
Member Author

Good. Let me add a few notes:

  1. the label are meant to provide an useful information when passing the mouse over the chart Include process tag in the hoverover of HTML reports #504
  2. I've added also the mean which is not reported in the current chart.
  3. I've split the storage data into reads and writes with the idea to render them in two separate charts, because I think the current report is not informative enough on disk usage.

pditommaso added a commit that referenced this issue Jan 24, 2018
This commit adds a new `summary` data object in the JSON 
payload embedded in the HTML execution report. This 
needs to be used to visualise the execution metrics 
charts, instead to compute them on client side which 
can be overkill for workflow executing a large number 
of tasks. 


Issues: #547, #504
@pditommaso
Copy link
Member Author

pditommaso commented Jan 24, 2018

Just pushed this change. Now the JSON payload includes a new summary data object structured as shown in the example in the previous comment.

@pditommaso
Copy link
Member Author

I've fixed a couple of issues and added the logging of the summary JSON in the nextflow log.

A note more, there a could corner case in which a series eg. cpu, mem, etc it's empty because the metric are missing. Currently it's reported as a null. For example:

 {
     "foo": {
         "cpu": null,
         "mem": null,
         "time": {
             "mean": 82.25,
             "min": 54.0,
             "q1": 79.5,
             "q2": 90.0,
             "q3": 92.75,
             "max": 95.0,
             "minLabel": "foo (4)",
             "maxLabel": "foo (1)",
             "q1Label": "foo (4)",
             "q2Label": "foo (3)",
             "q3Label": "foo (2)"
         },
         "reads": null,
         "writes": null
     }
 }

@pditommaso
Copy link
Member Author

Any progress here?

@ewels
Copy link
Member

ewels commented Jan 31, 2018

No sorry, haven't touched a keyboard for coding for a couple of weeks now. Next two weeks look packed too, will hopefully have some time after that (or will treat it as an evening / airplane project).

See you Monday! Maybe we can have a mini-hackathon during a coffee break ;)

@pditommaso
Copy link
Member Author

It sounds good!

@pditommaso
Copy link
Member Author

@ewels just to prevent any duplicated work, with @emi80 we have almost completed this.

@ewels
Copy link
Member

ewels commented Feb 16, 2018

Ah, you guys are heroes! Sorry for being so slow with this - I had a stab the other day at the airport back from Barcelona but couldn't get the testing environment running again for some reason so moved on to some of the MultiQC backlog instead.

pditommaso added a commit that referenced this issue Feb 16, 2018
This commit render the execution report boxplots 
using the report summary data collected by nextflow 
at runtime. 

The trace data is omitted when the number of tasks 
is greater than 10’000 (default) and the related 
table is not shown
@pditommaso
Copy link
Member Author

OK. We refactored a bit the report to allow the rendering of large number of tasks. Main changes are the following:

  1. task table rendering has been optimised to improve performance
  2. trace data is retuned only when it contains < 10'000 tasks (default), for a higher number the tasks table is not shown
  3. boxplots are rendered using the summary data produced by NF
  4. as side effect the distribution is not showed any more (that's really a pity, but i was unable to find a way to render it at least when the trace records are available.

Please review it at your earliest convenience.

@pditommaso
Copy link
Member Author

pditommaso commented Feb 16, 2018

Sorry one more thing, now we have the information of the task name for each metric ie. minLabel, q1Label, etc

However I was unable to show it over the boxplot as it was suggesting #504. Any idea if it's plotty allows that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants