-
Notifications
You must be signed in to change notification settings - Fork 685
The execution report cannot handle big workflows #547
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hah, this is a familiar problem.. :) Can perhaps drop the table and generate the plots at run time to embed in the HTML? |
It could be an option, but it should still parse the big JSON payload to compute the values displayed in the chart, shouldn't it? Maybe we could compute those values server side. Does it make sense ? |
Yes exactly, that's what I meant by run time. So in MultiQC I have a max number of samples and after that I generate flat image plots, which scale to infinity without increasing file size. The Plotly library that we're using for the report has built-in options for exporting to flat images like this, at least in the Python version: https://plot.ly/python/static-image-export/ Phil |
OK, but wouldn't be even easier moving this code NF side and leave the rendering on the client?
|
Yes, if the raw plot data isn't too big I guess. If we have 10s or 100s thousand of tasks with four numbers each that could still be quite a bit of data and browser processor load. But yes maybe let's try that first! Phil |
Wait, basically the above code creates a separate series for each process name, but it still need to hold in memory and parse all tasks. I was thinking it was calculating the min,max,median,etc values. Is it not possible to give Plotly just the final values to render? |
Apparently not 😞 plotly/plotly.js#1059 |
Too bad. This only leave the NF side image rendering, but I can't find the Java client for Plotly. Maybe it's possible to use the JS one with the Java embedded JavaScript engine .. |
😰 It might be worth trying with just the numbers first. I imagine that 99% of the filesize will be the other JSON stuff. |
Not so sure. I will give a try to see if it's possible to render a chart with Nashorn. |
Brilliant! This is totally cheating but why not?? |
Good. Being so if I implement the computation of [min, q1, median, q3, max] for each process, would you be able to render it? does it sound a good plan? |
Yes, sounds great 👍 I hesitated about what the whiskers show - the min / max or some other metric. Plotly has some documentation on this here: https://help.plot.ly/what-is-a-box-plot/#the-whiskers - I guess the current plots show the min / max? May be worth double checking though to be sure that we're doing the same thing as normal reports. |
OK, I will check that. |
So, i've double checked this and the whiskers show the min and the max. How would you need the json data structured to render these info? |
Great! Doesn't matter too much what the JSON looks like, I can work around it. At the moment we have But whatever is easiest for you really - as long as the task JSON isn't printed to the HTML (to avoid the filesize) and the new summaries are then we should be able to adapt the javascript accordingly. Phil |
Ok. Let's try to speedup this. Could you provide the javascript snippet to render that charts using a fake json structure? then I can generate it dynamically on NF side. |
Is it not nicer to just print the data into the report as JSON and then use the embedded javascript file that we already have for the javascript code? Nothing fancy required. From the top of my head: {
"process_summary": {
"time": [
[ "fastqc", 12, 24, 35, 46, 90 ],
[ "bwa", 120, 230, 340, 450, 560 ],
],
"cpu": [
[ "fastqc", 20, 50, 76, 100, 140 ],
],
}
} Where the arrays are process name, min, lower quartile, median, upper quartile, max. Or whatever. Once we have this in as JSON in the report (the above printed in the same way as the existing data) I can plug it into the existing plot code pretty easily. Phil |
What about a structure like this:
|
Perfect! Labels are maybe not really needed, but doesn't hurt. |
Good. Let me add a few notes:
|
Just pushed this change. Now the JSON payload includes a new summary data object structured as shown in the example in the previous comment. |
I've fixed a couple of issues and added the logging of the summary JSON in the nextflow log. A note more, there a could corner case in which a series eg.
|
Any progress here? |
No sorry, haven't touched a keyboard for coding for a couple of weeks now. Next two weeks look packed too, will hopefully have some time after that (or will treat it as an evening / airplane project). See you Monday! Maybe we can have a mini-hackathon during a coffee break ;) |
It sounds good! |
Ah, you guys are heroes! Sorry for being so slow with this - I had a stab the other day at the airport back from Barcelona but couldn't get the testing environment running again for some reason so moved on to some of the MultiQC backlog instead. |
This commit render the execution report boxplots using the report summary data collected by nextflow at runtime. The trace data is omitted when the number of tasks is greater than 10’000 (default) and the related table is not shown
OK. We refactored a bit the report to allow the rendering of large number of tasks. Main changes are the following:
Please review it at your earliest convenience. |
Sorry one more thing, now we have the information of the task name for each metric ie. However I was unable to show it over the boxplot as it was suggesting #504. Any idea if it's plotty allows that? |
Currently all tasks metadata and metrics are embedded as a JSON object in the execution report HTML file.
For big workflow executions ie. spawning 50K tasks or more, the resulting become to big (>100MB) and browser are not able to handle it.
The goal of this feature is to implement a better strategy to handle a large number of tasks.
/cc @ewels we need to brainstorm a bit about this at some point.
The text was updated successfully, but these errors were encountered: