| Title: METHODOLOGY FOR OPTIMIZATION OF NEURAL NETWORK ARCHITECTURES ON MOBILE DEVICES |
| Author: Pozdniakova Mariia Olehivna |
| Abstract: Modern phones carry more compute than the workstations that trained AlexNet, yet every extra millisecond on-device still costs battery and patience. To help engineers squeeze the last drop of efficiency from that silicon, this article distils findings from twelve rigorously vetted investigations released since 2021 that benchmark compression, pruning, quantisation, and hardware-aware search across ARM CPUs, mobile GPUs, and low-power DSPs. First, a layered conceptual frame links three optimisation levers-architecture, numerical precision, and sparse execution-to the twin constraints of energy per inference and perceptible latency. Then, using a unified effect-size index (joules × milliseconds-weighted error), we re-analyse reported results, correcting for dataset overlap and tooling variance. The synthesis is clear: eight-bit post-training quantisation remains the highest-yield, lowest-risk tactic, trimming power budgets by roughly one-third while preserving top-1 accuracy within a single percentage point, structured pruning-especially when aligned with dedicated sparse kernels-halves CPU-bound latency but less so on memory-hungry GPUs, hardware-guided neural architecture search shines when workloads mix vision and audio but offers diminishing returns once model size drops below two million parameters. Stacking these stages, in the order quantise-then-prune-then-search, delivers the steepest descent on our composite loss surface, hitting sub-3 mJ and sub-30 ms targets in over 80 % of the reviewed trials. Finally, cross-framework comparison reveals that lightweight runtimes tuned for shader execution edge out more general interpreters on Adreno-class GPUs, whereas conventional kernels remain king on mid-range CPUs. By translating scattered empirical numbers into a practical decision map, the paper offers a ready-to-apply checklist for developers who must ship privacy-respecting, latency-aware AI without burning months on fresh experiments. It also highlights blind spots-multi-modal transformers, diffusion backbones, dynamic voltage scaling-that demand the next wave of evidence rather than anecdote. |
| Keywords: Mobile inference, neural architecture optimisation, quantisation, pruning, energy-latency trade-off, meta-analysis, ARM processors. |
| PDF Download |