Replies: 4 comments 4 replies
-
THis is a good analysis. Let me dive into it a bit |
Beta Was this translation helpful? Give feedback.
-
Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations? |
Beta Was this translation helpful? Give feedback.
-
Amazing how we can do such analysis in OSS. ❤️ Couple of random thoughts while reading the post,
|
Beta Was this translation helpful? Give feedback.
-
FYI, with more data from public Android devices are found, I just updated the post to incorporate the private vs. public comparison. The metrics from the new data strengthens the conclusions, indicating using private AWS can provide decent stability for Android benchmarking. cc: @cbilgin @kimishpatel @digantdesai |
Beta Was this translation helpful? Give feedback.
-
Benchmark Infra Stability Assessment with private AWS devices
TL;DR
Here are the highlights and lowlights of the what plaint data shows.
I am sharing my conclusions and recommendations based on the data at the end.
Understanding Stability Metrics
To properly assess the stability of ML model inference latency, I use several key statistical metrics:
And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.
Intra-primary (private) Dataset Stability Comparison
I will begin the analysis by examining the key metrics for the primary (private) dataset. This section focuses on assessing the inherent stability of our benchmarking environment before making any comparisons to public infrastructure. By analyzing key statistical metrics mentioned above across different model and device combinations, we can establish a baseline understanding of performance consistency and stability.
Overall Stability Summary:
Device-based Comparison:
My insights and recommendations
The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency:
Intra-private analysis reveals that S22 with Android 13 and rooted Pixel 3 devices provide acceptable stability across all tested delegates (QNN and XNNPACK) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.
Inter-dataset (private & public) Stability Comparison
To assess whether private AWS devices provide better stability than their public counterparts, here I conducted a detailed comparison between matching datasets from both environments. This section presents an apple-to-apple comparison of benchmark stability for identical model-device combinations, allowing us to directly evaluate the benefits of moving to use private infrastructure.
1. llama3_qlora+s22_android13 (Private) vs. llama3_qlora+s22_android13 (Public)
Metrics Comparison:
Interpretation:
2. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)
Metrics Comparison:
Interpretation:
3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)
Metrics Comparison:
Interpretation:
Primary (Private) Datasets Summary:
Reference (Public) Datasets Summary:
My insights and recommendations
Detailed Stability Analysis on Individual Dataset - Primary (Private)
The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.
1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)
2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)
3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)
4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)
Summary of Conclusions and Next Steps
Android Benchmarking
The analysis shows that private AWS devices provide significantly better stability for Android benchmarking. The data supports a specific configuration strategy:
As next steps, we should:
iOS Benchmarking
Both private and public iOS devices show poor stability across all models (CV values of 21-37%), indicating fundamental flaws in our iOS benchmarking methodology/app. Until methodology improvements have been validated with new benchmark data, we're at the risk of having a meaningful conclusion of moving iOS benchmarking forward by end of June.
DevX Improvements
Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.
Current Gaps
Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.
References
Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.
The script used for analysis
Data source:
Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx
Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard
Beta Was this translation helpful? Give feedback.
All reactions