Benchmark Infra Stability Assessment with private AWS devices #10983

guangy10 · 2025-05-19T22:00:21Z

guangy10
May 19, 2025
Collaborator

Benchmark Infra Stability Assessment with private AWS devices

TL;DR
Here are the highlights and lowlights of the what plaint data shows.

Android benchmarks:
- Analysis of private vs. public AWS devices reveals that private infrastructure delivers 42-96% lower Coefficient of Variation (CV) and 60-81% higher stability scores across most model-device combinations.
- Intra-private analysis shows Android devices on private AWS infrastructure demonstrate excellent to good stability (scores 60.28-93.81/100) with CV values of 0.91-7.68%
iOS benchmarks: All iOS benchmark data exhibit poor stability regardless of device type (private/public) and model, suggesting issues with our benchmarking methodology that need addressing before meaningful conclusions can be made

I am sharing my conclusions and recommendations based on the data at the end.

Understanding Stability Metrics

To properly assess the stability of ML model inference latency, I use several key statistical metrics:

Coefficient of Variation (CV) / RSD
Percentile-Based Metrics: P50/P90/P95/P99
Interquartile Range (IQR)
Max-Min Range Ratio
Intra-Jitter Coefficient

And a composite stability score (0-100 scale) is calculated using weighted CV, Max/Min ratio, and P99/P50 ratio.

Intra-primary (private) Dataset Stability Comparison

I will begin the analysis by examining the key metrics for the primary (private) dataset. This section focuses on assessing the inherent stability of our benchmarking environment before making any comparisons to public infrastructure. By analyzing key statistical metrics mentioned above across different model and device combinations, we can establish a baseline understanding of performance consistency and stability.

Overall Stability Summary:

Model	Device	Mean Latency (ms)	CV (%)	Stability Score	Stability Rating	Max/Min Ratio	P99/P50 Ratio
mv3_qnn	s22ultra_android14	1.01	0.91	93.81	Excellent	1.09	1.02
llama3_spinq	s22_android13	21771.59	2.36	84.70	Good	1.15	1.07
llama3_qlora	s22_android13	22502.10	2.64	83.37	Good	1.14	1.07
mv3_qnn	s22_android13	1.01	2.34	82.41	Good	1.19	1.14
llama3_qlora	s22ultra_android14	25022.84	6.18	62.54	Moderate	1.27	1.13
llama3_spinq	s22ultra_android14	24761.78	6.27	60.28	Moderate	1.36	1.15
mv3_xnnq8	pixel3_rooted_android	5.93	7.68	46.93	Poor	1.70	1.24
mv3_xnnq8	s22_android13	2.73	23.03	14.94	Poor	2.44	1.68
mv3_xnnq8	iphone15max_ios17	13.98	24.60	10.82	Poor	3.29	1.57
llama3_qlora	iphone15max_ios17	15239.42	29.97	6.24	Poor	2.50	2.36
mv3_xnnq8	s22ultra_android14	2.91	39.08	0.00	Poor	5.61	2.33
llama3_spinq	iphone15max_ios17	14440.01	36.79	0.00	Poor	3.62	2.94

Device-based Comparison:

Device	('Stability Score', 'mean')	('Stability Score', 'min')	('Stability Score', 'max')	('CV (%)', 'mean')	('CV (%)', 'min')	('CV (%)', 'max')
s22_android13	66.36	14.94	84.70	7.59	2.34	23.03
s22ultra_android14	54.16	0.00	93.81	13.11	0.91	39.08
pixel3_rooted_android	46.93	46.93	46.93	7.68	7.68	7.68
iphone15max_ios17	5.69	0.00	10.82	30.45	24.60	36.79

My insights and recommendations

The analysis of latency stability across private AWS devices reveals certain patterns in performance consistency:

s22_android13 provides the most stable environment for various model execution.
mv3_qnn shows the most consistent performance across devices.
mv3_xnnq8 shows more variability and may need further optimization. The rooted device still provides the best stability.
iphone15max_ios17 shows higher variability in all setups and need further investigation.

Intra-private analysis reveals that S22 with Android 13 and rooted Pixel 3 devices provide acceptable stability across all tested delegates (QNN and XNNPACK) and models (Llama3.2-1b and MobileNetV3), demonstrating that our private AWS infrastructure can deliver consistent benchmarking results.

Inter-dataset (private & public) Stability Comparison

To assess whether private AWS devices provide better stability than their public counterparts, here I conducted a detailed comparison between matching datasets from both environments. This section presents an apple-to-apple comparison of benchmark stability for identical model-device combinations, allowing us to directly evaluate the benefits of moving to use private infrastructure.

1. llama3_qlora+s22_android13 (Private) vs. llama3_qlora+s22_android13 (Public)

Model: llama3_qlora
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	22502.10 ms	23841.98 ms	-1339.88 ms	-5.6%
Median Latency (ms)	22447.56 ms	23381.83 ms	-934.27 ms	-4.0%
Standard Deviation (ms)	595.01 ms	2079.97 ms	-1484.97 ms	-71.4%
CV (%)	2.64%	8.72%	-6.08%	-69.7%
IQR (ms)	858.26 ms	3183.16 ms	-2324.90 ms	-73.0%
P99 (ms)	23910.11 ms	28001.62 ms	-4091.51 ms	-14.6%
Max/Min Ratio	1.1423	1.4300	-0.2877	-20.1%
P99/P50 Ratio	1.0652	1.1976	-0.1324	-11.1%
Stability Score	83.4/100	46.1/100	37.3	81.0%
Stability Rating	Good	Poor	N/A	N/A

Interpretation:

Private environment shows better stability with a 81.0% higher stability score. (Private: 83.4/100 vs Public: 46.1/100)
Private environment has 69.7% lower coefficient of variation, indicating more consistent performance.
Private environment has 5.6% lower mean latency, indicating better performance.

2. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)

Model: llama3_spinq
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	21771.59 ms	22774.60 ms	-1003.01 ms	-4.4%
Median Latency (ms)	21668.24 ms	22491.89 ms	-823.65 ms	-3.7%
Standard Deviation (ms)	514.89 ms	1947.04 ms	-1432.15 ms	-73.6%
CV (%)	2.36%	8.55%	-6.18%	-72.3%
IQR (ms)	602.75 ms	3455.61 ms	-2852.87 ms	-82.6%
P99 (ms)	23104.76 ms	26148.53 ms	-3043.77 ms	-11.6%
Max/Min Ratio	1.1452	1.3483	-0.2031	-15.1%
P99/P50 Ratio	1.0663	1.1626	-0.0963	-8.3%
Stability Score	84.7/100	48.8/100	35.9	73.4%
Stability Rating	Good	Poor	N/A	N/A

Interpretation:

Private environment shows better stability with a 73.4% higher stability score. (Private: 84.7/100 vs Public: 48.8/100)
Private environment has 72.3% lower coefficient of variation, indicating more consistent performance.
Private environment has 4.4% lower mean latency, indicating better performance.

3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)

Model: mv3_xnnq8
Private Device: s22_android13
Public Device: s22_android13

Metrics Comparison:

Metric	Private (Primary)	Public (Reference)	Difference	% Change
Mean Latency (ms)	2.73 ms	1.92 ms	0.81 ms	42.1%
Median Latency (ms)	2.65 ms	1.06 ms	1.59 ms	150.0%
Standard Deviation (ms)	0.63 ms	1.06 ms	-0.43 ms	-40.6%
CV (%)	23.03%	55.09%	-32.06%	-58.2%
IQR (ms)	0.95 ms	1.63 ms	-0.68 ms	-41.9%
P99 (ms)	4.46 ms	4.63 ms	-0.18 ms	-3.8%
Max/Min Ratio	2.4427	6.1313	-3.6886	-60.2%
P99/P50 Ratio	1.6812	4.3683	-2.6871	-61.5%
Stability Score	14.9/100	0.0/100	14.9	Infinity
Stability Rating	Poor	Poor	N/A	N/A

Interpretation:

Private environment shows better stability. (Private: 14.9/100 vs Public: 0.0/100)
Private environment has 58.2% lower coefficient of variation, indicating more consistent performance.
Public environment has 42.1% lower mean latency, indicating better performance.

Primary (Private) Datasets Summary:

Dataset	Model	Device	Mean Latency (ms)	CV (%)	Stability Score	Stability Rating
mv3_qnn+s22ultra_android14	mv3_qnn	s22ultra_android14	1.01	0.91	93.81	Excellent
llama3_spinq+s22_android13	llama3_spinq	s22_android13	21771.59	2.36	84.70	Good
llama3_qlora+s22_android13	llama3_qlora	s22_android13	22502.10	2.64	83.37	Good
mv3_qnn+s22_android13	mv3_qnn	s22_android13	1.01	2.34	82.41	Good
llama3_qlora+s22ultra_android14	llama3_qlora	s22ultra_android14	25022.84	6.18	62.54	Moderate
llama3_spinq+s22ultra_android14	llama3_spinq	s22ultra_android14	24761.78	6.27	60.28	Moderate
mv3_xnnq8+pixel3_rooted_android	mv3_xnnq8	pixel3_rooted_android	5.93	7.68	46.93	Poor
mv3_xnnq8+s22_android13	mv3_xnnq8	s22_android13	2.73	23.03	14.94	Poor
mv3_xnnq8+iphone15max_ios17	mv3_xnnq8	iphone15max_ios17	13.98	24.60	10.82	Poor
llama3_qlora+iphone15max_ios17	llama3_qlora	iphone15max_ios17	15239.42	29.97	6.24	Poor
mv3_xnnq8+s22ultra_android14	mv3_xnnq8	s22ultra_android14	2.91	39.08	0.00	Poor
llama3_spinq+iphone15max_ios17	llama3_spinq	iphone15max_ios17	14440.01	36.79	0.00	Poor

Reference (Public) Datasets Summary:

Dataset	Model	Device	Mean Latency (ms)	CV (%)	Stability Score	Stability Rating
mv3_qnn+s22ultra_android14	mv3_qnn	s22ultra (_android14)	1.02	1.35	90.39	Excellent
llama3_spinq+s22_android13	llama3_spinq	s22 (_android13)	22774.60	8.55	48.84	Poor
llama3_qlora+s22_android13	llama3_qlora	s22 (_android13)	23841.98	8.72	46.07	Poor
llama3_spinq+s22_android12	llama3_spinq	s22 (_android12)	23902.04	10.92	40.15	Poor
llama3_spinq+s22ultra_android12	llama3_spinq	s22ultra (_android12)	24769.21	10.96	37.66	Poor
mv3_xnnq8+s22ultra_android14	mv3_xnnq8	s22ultra (_android14)	3.63	22.35	15.48	Poor
llama3_qlora+iphone15max_ios17	llama3_qlora	iphone15max (_ios17)	14133.01	21.37	10.57	Poor
llama3_spinq+iphone15max_ios17	llama3_spinq	iphone15max (_ios17)	13118.40	21.76	2.65	Poor
mv3_qnn+s22_android13	mv3_qnn	s22 (_android13)	1.44	57.29	0.00	Poor
mv3_xnnq8+s22_android13	mv3_xnnq8	s22 (_android13)	1.92	55.09	0.00	Poor
mv3_spinq+iphone15max_ios17	mv3_spinq	iphone15max (_ios17)	5.05	136.34	0.00	Poor

My insights and recommendations

Overall private Android devices provides much better stability over public devices, specifically, 42-96% lower Coefficient of Variation (CV) and 60-81% higher stability scores across most model-device combinations
High variability in ALL iOS dataset regardless it's private or public, indicating issue of app/benchmark config, e.g. no warmup, no cooldown, too less iters, etc. Further investigation is needed.

Detailed Stability Analysis on Individual Dataset - Primary (Private)

The full list of individual dataset analysis can be downloaded here. In this section I will highlight detailed statistical metrics for only a few selected datasets.

1. Latency Stability Analysis: llama3_spinq+s22_android13 (Primary)

================================================================================
Model: llama3_spinqant
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 21771.59 ms
  - Median latency (P50): 21668.24 ms
  - Mean trimmed latency: 21662.53 ms
  - Median trimmed latency: 21559.89 ms

Dispersion Metrics:
  - Standard deviation: 514.89 ms
  - Coefficient of variation (CV): 2.36%
  - Interquartile range (IQR): 602.75 ms
  - Trimmed standard deviation: 515.03 ms
  - Trimmed coefficient of variation: 2.38%

Percentile Metrics:
  - P50 (median): 21668.24 ms
  - P90: 22438.74 ms
  - P95: 22542.42 ms
  - P99: 23104.76 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1452
  - P99/P50 ratio: 1.0663
  - Mean rolling std (window=5): 449.10 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.50%
  - Max trimming effect ratio: 0.89%

Throughput Metrics:
  - Mean TPS: 33.76
  - TPS coefficient of variation: 4.70%

Stability Assessment:
  - Overall stability score: 84.7/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 84.7/100) with low
  variation between runs (CV: 2.36%).
  Performance is consistent and predictable for most use cases.

================================================================================

2. Latency Stability Analysis: mv3_qnn+s22_android13 (Primary)

================================================================================
Model: mv3_qnn
Device: s22_android13

Dataset Overview:
  - Number of samples: 100
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-15 21:14:41+00:00

Central Tendency Metrics:
  - Mean latency: 1.01 ms
  - Median latency (P50): 1.00 ms
  - Mean trimmed latency: 1.00 ms
  - Median trimmed latency: 1.00 ms

Dispersion Metrics:
  - Standard deviation: 0.02 ms
  - Coefficient of variation (CV): 2.34%
  - Interquartile range (IQR): 0.01 ms
  - Trimmed standard deviation: 0.02 ms
  - Trimmed coefficient of variation: 2.27%

Percentile Metrics:
  - P50 (median): 1.00 ms
  - P90: 1.01 ms
  - P95: 1.01 ms
  - P99: 1.14 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 1.1919
  - P99/P50 ratio: 1.1404
  - Mean rolling std (window=5): 0.01 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 0.19%
  - Max trimming effect ratio: 1.00%

Stability Assessment:
  - Overall stability score: 82.4/100
  - Overall stability rating: Good

Interpretation:
  The benchmark shows good stability (score: 82.4/100) with low
  variation between runs (CV: 2.34%).
  Performance is consistent and predictable for most use cases.

  The P99/P50 ratio of 1.14 suggests
  occasional latency spikes that could affect tail latency sensitive applications.

================================================================================

3. Latency Stability Analysis: mv3_xnnq8+s22_android13 (Primary)

================================================================================
Model: mv3_xnnq8
Device: s22_android13

Dataset Overview:
  - Number of samples: 88
  - Date range: 2025-04-29 09:48:57+00:00 to 2025-05-13 21:08:36+00:00

Central Tendency Metrics:
  - Mean latency: 2.73 ms
  - Median latency (P50): 2.65 ms
  - Mean trimmed latency: 2.22 ms
  - Median trimmed latency: 2.10 ms

Dispersion Metrics:
  - Standard deviation: 0.63 ms
  - Coefficient of variation (CV): 23.03%
  - Interquartile range (IQR): 0.95 ms
  - Trimmed standard deviation: 0.36 ms
  - Trimmed coefficient of variation: 15.98%

Percentile Metrics:
  - P50 (median): 2.65 ms
  - P90: 3.59 ms
  - P95: 3.74 ms
  - P99: 4.46 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 2.4427
  - P99/P50 ratio: 1.6812
  - Mean rolling std (window=5): 0.60 ms

Intra-Jitter Metrics (variability within runs):
  - Mean trimming effect ratio: 16.52%
  - Max trimming effect ratio: 36.96%

Stability Assessment:
  - Overall stability score: 14.9/100
  - Overall stability rating: Poor

================================================================================

4. Latency Stability Analysis: llama3_spinq+iphone15max_ios17 (Primary)

================================================================================
Model: llama3_spinquant
Device: iphone15max_ios17

Dataset Overview:
  - Number of samples: 72
  - Date range: 2025-04-29 21:26:38+00:00 to 2025-05-10 09:24:40+00:00

Central Tendency Metrics:
  - Mean latency: 14440.01 ms
  - Median latency (P50): 12149.50 ms

Dispersion Metrics:
  - Standard deviation: 5312.72 ms
  - Coefficient of variation (CV): 36.79%
  - Interquartile range (IQR): 2231.00 ms

Percentile Metrics:
  - P50 (median): 12149.50 ms
  - P90: 18765.00 ms
  - P95: 25178.50 ms
  - P99: 35673.00 ms

Inter-Jitter Metrics (variability between runs):
  - Max/Min ratio: 3.6163
  - P99/P50 ratio: 2.9362
  - Mean rolling std (window=5): 3238.06 ms

Throughput Metrics:
  - Mean TPS: 11.66
  - TPS coefficient of variation: 38.53%

Stability Assessment:
  - Overall stability score: 0.0/100
  - Overall stability rating: Poor

================================================================================

Summary of Conclusions and Next Steps

Android Benchmarking

The analysis shows that private AWS devices provide significantly better stability for Android benchmarking. The data supports a specific configuration strategy:

S22 with Android 13 provides excellent stability for LLM models (CV values of 2.3-2.6%) and QNN delegate (CV of 2.34%)
Rooted Pixel 3 offers the best stability for tiny non-genAI models (CV of 7.68% for MV3 XNNQ8)
As next steps, we should:

Expand the size of the private pool and start running more models+configs. By end of June, I hope we will have sufficient data to make a definitive go/no-go decision.
At meanwhile, investigate MV3 XNNQ8 variability. If we can root cause and stabilize it, we won't need to run anything on rooted device.

iOS Benchmarking

Both private and public iOS devices show poor stability across all models (CV values of 21-37%), indicating fundamental flaws in our iOS benchmarking methodology/app. Until methodology improvements have been validated with new benchmark data, we're at the risk of having a meaningful conclusion of moving iOS benchmarking forward by end of June.

DevX Improvements

Our current benchmarking infrastructure has critical gaps that limit our ability to understand and address stability issues. These limitations are particularly problematic when trying to diagnose the root causes of performance variations we've observed across devices.

Current Gaps

Lack of Device Telemetry: We currently have no mechanism to collect essential device state information during benchmarking runs, including:
- CPU frequency and thermal state data
- Memory usage patterns and garbage collection events
- Background process activity that may interfere with benchmarks
- Battery state and power mode settings that can affect performance
Limited Analysis Capabilities and Manual Process: Our infrastructure lacks built-in stability assessment tools, forcing a time-consuming manual process:
- No automated calculation of key statistical metrics (CV, IQR, P99/P50), and no standardized criteria for interpreting results
- Manual data extraction from the benchmark dashboard
- No way to correlate performance variations with individual device and its state

Addressing these gaps is urgent to establish a reliable benchmarking infrastructure. Without these improvements, we risk making timely decisions and basing conclusions on misleading or incomplete data.

References

Here I attached the source of data and my script if anyone want to repeat the work. Please also use it as a reference when filling the infra gaps above.

The script used for analysis

PR: [Not To Land] Script for benchmark satbility assessment #10982

Data source:

Datasets from Primary/Private AWS devices:
Benchmark Dataset with Private AWS Devices.xlsx
Datasets from Reference/Public AWS devices:
Benchmark Dataset with Public AWS Devices.xlsx

Each tab represent one dataset collected with one model+config+device combination. The source of the data are copied from the ExecuTorch benchmark dashboard

kimishpatel · 2025-05-20T14:30:57Z

kimishpatel
May 20, 2025
Collaborator

THis is a good analysis. Let me dive into it a bit

0 replies

GregoryComer · 2025-05-20T21:55:51Z

GregoryComer
May 20, 2025
Collaborator

Thanks for compiling this. Looking at the next steps, are we intending to analyze the impact of warmup iterations or high iteration counts? It would be very interesting to see the data on runtimes per iteration over say 1000 iterations run back to back. Does it stabilize? Or randomly spike? Can we rely on median latency being reliable over a large number of iterations?

1 reply

guangy10 May 22, 2025
Collaborator Author

warmup is supported in android benchmark. @kirklandsign can clarify how many warmup iters and how many inference runs after. There may be room to tweak for better stability.

digantdesai · 2025-05-21T15:01:39Z

digantdesai
May 21, 2025
Collaborator

Amazing how we can do such analysis in OSS. ❤️
Kudos @guangy10 and team for all the efforts, and what it has enabled thus far.

Couple of random thoughts while reading the post,

Using CV (RSD) is good. How many runs did we do for each? 10?
mv3_xnnq8 is noisy on a CPU because it is tiny, easy way to "fix" would be (1) to run on a single, pinned core or (2) chain 10 inferences back to back to make the model "larger".
What is next on ios unstability mitigation side?
Did we figure out why public devices (non qnn) are noisy yet? Since private looks (esp Android) good, this is not a real blocker.
What is next in terms of new delegates? Vulkan? @SS-JIA or CoreML? @metascroy

2 replies

guangy10 May 22, 2025
Collaborator Author

Using CV (RSD) is good. How many runs did we do for each? 10?

There are over 80+ datapoints in each dataset (primary/private devices). The data I used are from late April to May 15. Since we are running those jobs continuously, we should already have over 100 datapoints for each.

mv3_xnnq8 is noisy on a CPU because it is tiny, easy way to "fix" would be (1) to run on a single, pinned core or (2) chain 10 inferences back to back to make the model "larger".

Thanks @digantdesai, that's good suggestions. cc: @kirklandsign can we pin the benchmark app to certain core while running? If it requires root permisison, we have a rooted pixel in the pool.

What is next on ios unstability mitigation side?

The analysis shows the unstability is device pool independent, so I would probably start by looking into the app, many things @kirklandsign set for Android benchmark app are probably not supported in the iOS benchmark app, .e.g. warmup iters, cooldown period by forcing sleep, trimmed outliers, etc. @shoumikhin can you take this?

Did we figure out why public devices (non qnn) are noisy yet? Since private looks (esp Android) good, this is not a real blocker.

As mentioned in the Next Steps, we don't have device telemetry yet though I've been asking this for a while. @kirklandsign an
d @huydhn will prioritize adding support for this. Actually, the QNN case is good justification why we needs this. The Coefficient of variation (CV) is 2.34%, and P99/P50 ratio is 1.1404. Both are very good. However, we can see occasional outlier causing sharp spike in the graph view here. Without the device telemetry, we have no way to understand why it happened.

digantdesai May 22, 2025
Collaborator

There are over 80+ datapoints in each dataset (primary/private devices). The data I used are from late April to May 15. Since we are running those jobs continuously, we should already have over 100 datapoints for each.

Not sure how you are isolating the code changes throughout this, and also any other changes happening in the device i.e. android updates by aws (just making things up).

I was reading this as CV for a given ET sha.

many things @kirklandsign set for Android benchmark app are probably not supported in the iOS benchmark app, .e.g. warmup iters, cooldown period by forcing sleep, trimmed outlier

Makes sense. Thanks. iOS, I bet, is harder to "rein in".

However, we can see occasional outlier causing sharp spike in the graph view here. Without the device telemetry, we have no way to understand why it happened

Exactly agree with QNN observation. It feels like if finding the main culprit for public device noise, we might allow us to use "any random device" more confidently using this framework.

guangy10 · 2025-05-22T00:52:00Z

guangy10
May 22, 2025
Collaborator Author

FYI, with more data from public Android devices are found, I just updated the post to incorporate the private vs. public comparison. The metrics from the new data strengthens the conclusions, indicating using private AWS can provide decent stability for Android benchmarking. cc: @cbilgin @kimishpatel @digantdesai

1 reply

digantdesai May 22, 2025
Collaborator

I think using private now is a no brainer. We should dig a bit more into public devices, perhaps share this with AWS folks if they have some learning?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark Infra Stability Assessment with private AWS devices #10983

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Benchmark Infra Stability Assessment with private AWS devices #10983

guangy10 May 19, 2025 Collaborator

Benchmark Infra Stability Assessment with private AWS devices

Understanding Stability Metrics

Intra-primary (private) Dataset Stability Comparison

Overall Stability Summary:

Device-based Comparison:

My insights and recommendations

Inter-dataset (private & public) Stability Comparison

1. llama3_qlora+s22_android13 (Private) vs. llama3_qlora+s22_android13 (Public)

2. llama3_spinq+s22_android13 (Private) vs llama3_spinq+s22_android13 (Public)

3. mv3_xnnq8+s22_android13 (Private) vs. mv3_xnnq8+s22_android13 (Public)

Primary (Private) Datasets Summary:

Reference (Public) Datasets Summary:

My insights and recommendations

Detailed Stability Analysis on Individual Dataset - Primary (Private)

Summary of Conclusions and Next Steps

Android Benchmarking

iOS Benchmarking

DevX Improvements

Current Gaps

References

The script used for analysis

Data source:

Replies: 4 comments · 4 replies

kimishpatel May 20, 2025 Collaborator

GregoryComer May 20, 2025 Collaborator

guangy10 May 22, 2025 Collaborator Author

digantdesai May 21, 2025 Collaborator

guangy10 May 22, 2025 Collaborator Author

digantdesai May 22, 2025 Collaborator

guangy10 May 22, 2025 Collaborator Author

digantdesai May 22, 2025 Collaborator

guangy10
May 19, 2025
Collaborator

Replies: 4 comments 4 replies

kimishpatel
May 20, 2025
Collaborator

GregoryComer
May 20, 2025
Collaborator

guangy10 May 22, 2025
Collaborator Author

digantdesai
May 21, 2025
Collaborator

guangy10 May 22, 2025
Collaborator Author

digantdesai May 22, 2025
Collaborator

guangy10
May 22, 2025
Collaborator Author

digantdesai May 22, 2025
Collaborator