# qwen35_backend_patch **Repository Path**: swner_admin/qwen35_backend_patch ## Basic Information - **Project Name**: qwen35_backend_patch - **Description**: 面向华为昇腾910A的 Qwen3.5-9B 多模态部署与优化工作区 - **Primary Language**: Unknown - **License**: Apache-2.0 - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-03-26 - **Last Updated**: 2026-03-30 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # qwen35_backend_patch for Ascend 910A Ascend `910A` oriented Qwen3.5-9B multimodal deployment and optimization workspace. This repository is explicitly developed for Huawei Ascend `910A`, not as a generic multi-platform backend. ## Scope - Target hardware: Huawei Ascend `910A` - Target model family: `Qwen3.5-9B` multimodal - Main goal: make `910A` deployment actually usable under real text, multimodal, and mixed traffic - Current single-service stable direction: `transformers serve` + `recurrent v2` - Current fastest mixed-traffic direction: `adaptive router` + `18083(mm)` + `18084(recurrent v2)` + `18087(v3)` This repository contains: - source-level experiments around `vllm` / `vllm-ascend` - `transformers serve` based multimodal deployment scripts - adaptive three-lane routing scripts for mixed traffic - public web console and OpenAI-compatible proxy for leadership/demo testing - isolated candidate rollout scripts for `chunk v2`, `recurrent v2`, `v3`, and `v4` - benchmark and comparison utilities - experiment records and deployment notes ## Current recommendation For single-service rollout on one `910A`, the current most stable candidate is: - `recurrent v2` For mixed-traffic rollout with one unified entrypoint, the current fastest option is: - `adaptive router` on top of: - multimodal target `18083` - stable text target `18084` running `recurrent v2` - throughput text target `18087` running `v3` - text policy `balanced_long_text` For text-heavy high-concurrency single-service scenarios, keep evaluating: - `v3` See: - [QWEN35_MM_910A_STABLE_DEPLOYMENT_PLAN_2026-03-29.md](QWEN35_MM_910A_STABLE_DEPLOYMENT_PLAN_2026-03-29.md) - [QWEN35_MM_910A_DEPLOYMENT_2026-03-29.md](QWEN35_MM_910A_DEPLOYMENT_2026-03-29.md) - [QWEN35_MM_910A_LEADERSHIP_TEST_NOTE_2026-03-30.md](QWEN35_MM_910A_LEADERSHIP_TEST_NOTE_2026-03-30.md) ## Leadership demo entrypoints Public demo entrypoints currently exposed on the Ascend `910A` host: - web console: `http://:9105/` - recommended multimodal/mixed API: `http://:9101/v1/chat/completions` - dedicated multimodal API: `http://:9102/v1/chat/completions` - stable multimodal API: `http://:9103/v1/chat/completions` - fastest text API: `http://:9111/v1/chat/completions` The `9105` console provides: - new-session / multi-turn chat - multi-image upload in a single turn - streaming toggle - per-response `tok/s`, output-token count, and KV-cache decode-token count - automatic long-history context compression The current long-history compression path is: - sentence importance scored by `TextRank / PageRank` - per-segment retention ratio allocated by simulated annealing under a token budget This keeps recent turns verbatim while compressing older history before requests are sent upstream. ## 910A performance snapshot Measured on `2026-03-29` and `2026-03-30` on Huawei Ascend `910A`. Numbers below are end-to-end latency in seconds, not kernel-only microbenchmarks. ### Single request latency | Case | baseline `18084` | `recurrent v2` | `v3` | | --- | ---: | ---: | ---: | | `text_8` | `0.946` | `0.908` | `0.852 ~ 0.881` | | `text_32` | `3.644` | `3.196` | `3.172 ~ 3.376` | | `text_64` | `6.670` | `6.160` | `5.889 ~ 6.546` | | `mm_8` | `0.695` | `0.654` | `0.663 ~ 0.698` | | `stream_text_8 total` | `0.995` | `0.898` | `0.849 ~ 0.918` | ### Text concurrency | Case | baseline `18084` | `recurrent v2` | `v3` | | --- | ---: | ---: | ---: | | `concurrency=4 wall` | `4.070` | `3.887` | `3.106` | | `concurrency=8 wall` | `9.542` | `6.650` | `6.189` | Approximate request throughput from the `concurrency=8` run: - baseline `18084`: `8 / 9.542 = 0.84 req/s` - `recurrent v2`: `8 / 6.650 = 1.20 req/s` - `v3`: `8 / 6.189 = 1.29 req/s` ### Multimodal and mixed traffic | Case | baseline `18084` | `recurrent v2` | `v3` | | --- | ---: | ---: | ---: | | `mm concurrency=2 wall` | `1.163` | `1.156` | `1.064` | | `mm concurrency=4 wall` | `2.462` | `2.449` | `2.842` | | `mixed(2 text + 2 mm) wall` | `3.581` | `2.684` | `3.090` | ### Current live three-lane routing This mode keeps: - multimodal requests on dedicated `18083` - the first short text on `18087`, and the next short text can spill to idle `18084` - long text split across `18087` and `18084` - a unified entrypoint on router port `18088` Latest benchmark on `2026-03-30` after promoting `18084` to `recurrent v2` in place and enabling short-text spill: | Case | adaptive router `18088` | | --- | ---: | | `mixed(2 text8 + 2 mm8) wall` | `1.173` | | `mixed(2 text32 + 2 mm8) wall` | `3.177` | | `mixed(2 text64 + 2 mm8) wall` | `6.248` | Further short-text mixed validation on the same topology: - `mixed(3 text8 + 2 mm8) wall = 1.792s` - `mixed(4 text8 + 2 mm8) wall = 2.229s` - `mixed(4 text32 + 2 mm8) wall = 6.403s` - the stale remote router previously hit `502` on `4 text8 + 2 mm8` ### Practical recommendation - If you only have one NPU for this model service, use `recurrent v2`. - If you have isolated lanes for `18083/18084/18087`, use the adaptive router with `ADAPTIVE_ROUTER_TEXT_POLICY=balanced_long_text`. - Keep `18084` on `recurrent v2`, keep `18087` on `v3`, and dedicate `18083` to multimodal traffic. - If you only optimize for text-heavy high concurrency, keep evaluating `v3`. - `chunk v2` is useful as part of `v3`, but not as the main standalone rollout candidate. - If traffic is pure multimodal and you want the lowest latency, call `18083` directly instead of routing through the adaptive proxy. ## One-click startup Single-service stable mode: ```bash MODE=stable STABLE_NPU=4 STABLE_PORT=18084 bash remote_one_click_qwen35_mm_910a.sh ``` Text-heavy throughput mode: ```bash MODE=throughput THROUGHPUT_NPU=3 bash remote_one_click_qwen35_mm_910a.sh ``` Adaptive three-lane mode: ```bash MODE=adaptive \ STABLE_NPU=4 \ STABLE_PORT=18084 \ THROUGHPUT_NPU=3 \ THROUGHPUT_PORT=18087 \ MULTIMODAL_NPU=7 \ MULTIMODAL_PORT=18083 \ ADAPTIVE_ROUTER_PORT=18088 \ ADAPTIVE_STABLE_BASE_URL=http://127.0.0.1:18084 \ ADAPTIVE_MULTIMODAL_BASE_URL=http://127.0.0.1:18083 \ ADAPTIVE_ROUTER_TEXT_POLICY=balanced_long_text \ ADAPTIVE_ROUTER_LONG_TEXT_MIN_TOKENS=32 \ ADAPTIVE_ROUTER_SCRIPT_ROOT=/root/qwen35_backend_patch_runtime \ bash remote_one_click_qwen35_mm_910a.sh ``` Adaptive route check: ```bash ROUTER_PORT=18088 bash remote_check_qwen35_mm_adaptive_router.sh ``` ## Repository layout - `patch_*.py`: source patch helpers - `test_*.py`: correctness checks for patch logic - `remote_*.sh`: isolated remote deployment, restart, benchmark, and compare scripts - `benchmark_*.py`: benchmark utilities - `vllm/`, `vllm_ascend/`: source adaptation workspace - `*.md`: experiment records, deployment plans, and runbooks ## Usage notes - The scripts are written for isolated rollout on a dedicated Ascend host. - Host-specific directories are configurable through environment variables. - Public docs use placeholders such as `` and `` where host identity should not be disclosed. ## License This repository is released under the Apache License 2.0.