# qwen35_backend_patch

**Repository Path**: swner_admin/qwen35_backend_patch

## Basic Information

- **Project Name**: qwen35_backend_patch
- **Description**: 面向华为昇腾910A的 Qwen3.5-9B 多模态部署与优化工作区
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2026-03-26
- **Last Updated**: 2026-03-30

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# qwen35_backend_patch for Ascend 910A

Ascend `910A` oriented Qwen3.5-9B multimodal deployment and optimization workspace.

This repository is explicitly developed for Huawei Ascend `910A`, not as a generic multi-platform backend.

## Scope

- Target hardware: Huawei Ascend `910A`
- Target model family: `Qwen3.5-9B` multimodal
- Main goal: make `910A` deployment actually usable under real text, multimodal, and mixed traffic
- Current single-service stable direction: `transformers serve` + `recurrent v2`
- Current fastest mixed-traffic direction: `adaptive router` + `18083(mm)` + `18084(recurrent v2)` + `18087(v3)`

This repository contains:

- source-level experiments around `vllm` / `vllm-ascend`
- `transformers serve` based multimodal deployment scripts
- adaptive three-lane routing scripts for mixed traffic
- public web console and OpenAI-compatible proxy for leadership/demo testing
- isolated candidate rollout scripts for `chunk v2`, `recurrent v2`, `v3`, and `v4`
- benchmark and comparison utilities
- experiment records and deployment notes

## Current recommendation

For single-service rollout on one `910A`, the current most stable candidate is:

- `recurrent v2`

For mixed-traffic rollout with one unified entrypoint, the current fastest option is:

- `adaptive router` on top of:
  - multimodal target `18083`
  - stable text target `18084` running `recurrent v2`
  - throughput text target `18087` running `v3`
  - text policy `balanced_long_text`

For text-heavy high-concurrency single-service scenarios, keep evaluating:

- `v3`

See:

- [QWEN35_MM_910A_STABLE_DEPLOYMENT_PLAN_2026-03-29.md](QWEN35_MM_910A_STABLE_DEPLOYMENT_PLAN_2026-03-29.md)
- [QWEN35_MM_910A_DEPLOYMENT_2026-03-29.md](QWEN35_MM_910A_DEPLOYMENT_2026-03-29.md)
- [QWEN35_MM_910A_LEADERSHIP_TEST_NOTE_2026-03-30.md](QWEN35_MM_910A_LEADERSHIP_TEST_NOTE_2026-03-30.md)

## Leadership demo entrypoints

Public demo entrypoints currently exposed on the Ascend `910A` host:

- web console: `http://<REMOTE_HOST>:9105/`
- recommended multimodal/mixed API: `http://<REMOTE_HOST>:9101/v1/chat/completions`
- dedicated multimodal API: `http://<REMOTE_HOST>:9102/v1/chat/completions`
- stable multimodal API: `http://<REMOTE_HOST>:9103/v1/chat/completions`
- fastest text API: `http://<REMOTE_HOST>:9111/v1/chat/completions`

The `9105` console provides:

- new-session / multi-turn chat
- multi-image upload in a single turn
- streaming toggle
- per-response `tok/s`, output-token count, and KV-cache decode-token count
- automatic long-history context compression

The current long-history compression path is:

- sentence importance scored by `TextRank / PageRank`
- per-segment retention ratio allocated by simulated annealing under a token budget

This keeps recent turns verbatim while compressing older history before requests are sent upstream.

## 910A performance snapshot

Measured on `2026-03-29` and `2026-03-30` on Huawei Ascend `910A`.
Numbers below are end-to-end latency in seconds, not kernel-only microbenchmarks.

### Single request latency

| Case | baseline `18084` | `recurrent v2` | `v3` |
| --- | ---: | ---: | ---: |
| `text_8` | `0.946` | `0.908` | `0.852 ~ 0.881` |
| `text_32` | `3.644` | `3.196` | `3.172 ~ 3.376` |
| `text_64` | `6.670` | `6.160` | `5.889 ~ 6.546` |
| `mm_8` | `0.695` | `0.654` | `0.663 ~ 0.698` |
| `stream_text_8 total` | `0.995` | `0.898` | `0.849 ~ 0.918` |

### Text concurrency

| Case | baseline `18084` | `recurrent v2` | `v3` |
| --- | ---: | ---: | ---: |
| `concurrency=4 wall` | `4.070` | `3.887` | `3.106` |
| `concurrency=8 wall` | `9.542` | `6.650` | `6.189` |

Approximate request throughput from the `concurrency=8` run:

- baseline `18084`: `8 / 9.542 = 0.84 req/s`
- `recurrent v2`: `8 / 6.650 = 1.20 req/s`
- `v3`: `8 / 6.189 = 1.29 req/s`

### Multimodal and mixed traffic

| Case | baseline `18084` | `recurrent v2` | `v3` |
| --- | ---: | ---: | ---: |
| `mm concurrency=2 wall` | `1.163` | `1.156` | `1.064` |
| `mm concurrency=4 wall` | `2.462` | `2.449` | `2.842` |
| `mixed(2 text + 2 mm) wall` | `3.581` | `2.684` | `3.090` |

### Current live three-lane routing

This mode keeps:

- multimodal requests on dedicated `18083`
- the first short text on `18087`, and the next short text can spill to idle `18084`
- long text split across `18087` and `18084`
- a unified entrypoint on router port `18088`

Latest benchmark on `2026-03-30` after promoting `18084` to `recurrent v2` in place and enabling short-text spill:

| Case | adaptive router `18088` |
| --- | ---: |
| `mixed(2 text8 + 2 mm8) wall` | `1.173` |
| `mixed(2 text32 + 2 mm8) wall` | `3.177` |
| `mixed(2 text64 + 2 mm8) wall` | `6.248` |

Further short-text mixed validation on the same topology:

- `mixed(3 text8 + 2 mm8) wall = 1.792s`
- `mixed(4 text8 + 2 mm8) wall = 2.229s`
- `mixed(4 text32 + 2 mm8) wall = 6.403s`
- the stale remote router previously hit `502` on `4 text8 + 2 mm8`

### Practical recommendation

- If you only have one NPU for this model service, use `recurrent v2`.
- If you have isolated lanes for `18083/18084/18087`, use the adaptive router with `ADAPTIVE_ROUTER_TEXT_POLICY=balanced_long_text`.
- Keep `18084` on `recurrent v2`, keep `18087` on `v3`, and dedicate `18083` to multimodal traffic.
- If you only optimize for text-heavy high concurrency, keep evaluating `v3`.
- `chunk v2` is useful as part of `v3`, but not as the main standalone rollout candidate.
- If traffic is pure multimodal and you want the lowest latency, call `18083` directly instead of routing through the adaptive proxy.

## One-click startup

Single-service stable mode:

```bash
MODE=stable STABLE_NPU=4 STABLE_PORT=18084 bash remote_one_click_qwen35_mm_910a.sh
```

Text-heavy throughput mode:

```bash
MODE=throughput THROUGHPUT_NPU=3 bash remote_one_click_qwen35_mm_910a.sh
```

Adaptive three-lane mode:

```bash
MODE=adaptive \
STABLE_NPU=4 \
STABLE_PORT=18084 \
THROUGHPUT_NPU=3 \
THROUGHPUT_PORT=18087 \
MULTIMODAL_NPU=7 \
MULTIMODAL_PORT=18083 \
ADAPTIVE_ROUTER_PORT=18088 \
ADAPTIVE_STABLE_BASE_URL=http://127.0.0.1:18084 \
ADAPTIVE_MULTIMODAL_BASE_URL=http://127.0.0.1:18083 \
ADAPTIVE_ROUTER_TEXT_POLICY=balanced_long_text \
ADAPTIVE_ROUTER_LONG_TEXT_MIN_TOKENS=32 \
ADAPTIVE_ROUTER_SCRIPT_ROOT=/root/qwen35_backend_patch_runtime \
bash remote_one_click_qwen35_mm_910a.sh
```

Adaptive route check:

```bash
ROUTER_PORT=18088 bash remote_check_qwen35_mm_adaptive_router.sh
```

## Repository layout

- `patch_*.py`: source patch helpers
- `test_*.py`: correctness checks for patch logic
- `remote_*.sh`: isolated remote deployment, restart, benchmark, and compare scripts
- `benchmark_*.py`: benchmark utilities
- `vllm/`, `vllm_ascend/`: source adaptation workspace
- `*.md`: experiment records, deployment plans, and runbooks

## Usage notes

- The scripts are written for isolated rollout on a dedicated Ascend host.
- Host-specific directories are configurable through environment variables.
- Public docs use placeholders such as `<REMOTE_HOST>` and `<SSH_PORT>` where host identity should not be disclosed.

## License

This repository is released under the Apache License 2.0.