# AtomicRAG **Repository Path**: joo2019/AtomicRAG ## Basic Information - **Project Name**: AtomicRAG - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2026-05-26 - **Last Updated**: 2026-05-26 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation [![arXiv](https://img.shields.io/badge/arXiv-2604.20844-b31b1b.svg)](https://arxiv.org/abs/2604.20844) [![GitHub](https://img.shields.io/badge/GitHub-7HHHHH%2FAtomicRAG-181717.svg)](https://github.com/7HHHHH/AtomicRAG) Official implementation of **AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation**. AtomicRAG is a graph-based retrieval-augmented generation framework that stores knowledge as fine-grained, self-contained **knowledge atoms** rather than coarse text chunks or predicate-only triples. It builds an **Atom-Entity Graph (AEG)**, retrieves evidence through entity-aware graph propagation, and filters retrieved atoms before grounded answer generation. Paper: [arXiv:2604.20844](https://arxiv.org/abs/2604.20844) Code: [https://github.com/7HHHHH/AtomicRAG](https://github.com/7HHHHH/AtomicRAG) ## Overview AtomicRAG addresses two common failure modes in GraphRAG-style systems: - Coarse chunk-level storage entangles multiple facts in one retrieval unit, which adds noise for fine-grained or multi-hop questions. - Predicate-labeled triple graphs can be brittle when relation extraction is incomplete or incorrect. ## Motivation

AtomicRAG motivation: from coarse chunks and brittle triples to atom-entity graphs

## Framework

AtomicRAG framework: atom-entity graph construction, retrieval, filtering, and generation

The pipeline has four stages: 1. **Atom-Entity Graph Construction**: decomposes corpus chunks into knowledge atoms, extracts entities, and constructs an unlabeled weighted graph with containment, relevance, and synonym edges. 2. **Atomic Question Decomposition**: scores query complexity and decomposes only complex questions into up to three atom-aligned sub-questions. 3. **Entity-Resonance Graph Retrieval**: combines dense atom retrieval with entity seeds and Personalized PageRank over the AEG. 4. **Atomic Sieve and Answer Generation**: filters redundant or irrelevant atoms and passes a compact evidence set to the generator. ## Repository Layout ```text atomicrag/ Core AtomicRAG implementation atomicrag.py Indexing, retrieval, graph propagation, QA fragment_filter.py Atomic Sieve implementation query_decomposition.py Complexity scoring and query decomposition Evaluation/ Generation-quality evaluation scripts/ Dataset runners and helper scripts configs/atomicrag/ Runtime configuration templates dataset/ Graph-Bench-style corpus folders used by scripts datasets/atomicrag/ Additional precise-QA data files workspaces/ Generated graph/vector caches results/ Predictions, evaluation files, and run statistics ``` `workspaces/`, `results/`, logs, and `configs/atomicrag/llm.env` are ignored by git because they contain generated artifacts or private credentials. The benchmark data under `dataset/` and `datasets/` is intended to be versioned for the public release. ## Installation Python 3.10 is recommended. ```bash conda create -n atomicrag python=3.10 -y conda activate atomicrag pip install -r requirements/atomicrag.txt pip install -r requirements.txt pip install -e . ``` The default embedding model is `BAAI/bge-large-en-v1.5`. Make sure the model can be downloaded from Hugging Face, or set your local Hugging Face cache/offline environment variables before running. ## Configuration AtomicRAG uses OpenAI-compatible chat APIs for LLM calls. You can configure credentials through environment variables or `configs/atomicrag/llm.env`. ```bash export LLM_BASE_URL=https://api.openai.com/v1 export LLM_API_KEY=your_api_key export OPENAI_API_KEY=$LLM_API_KEY export CUDA_VISIBLE_DEVICES=0 ``` Application-level concurrency defaults live in `configs/atomicrag/config.json`: ```json { "models": { "openai": { "max_concurrency": 200 } }, "evaluation": { "max_concurrency": 200 } } ``` For local setup, copy `configs/atomicrag/llm.env.example` to `configs/atomicrag/llm.env` and fill in your own API settings. Only the example file is versioned. ## Data Format The main runner expects each subset to contain one or more corpus directories: ```text dataset/// chunks.json questions.json ``` `chunks.json` is a list of source text chunks. `questions.json` is a list of QA records with at least: ```json { "id": "example-id", "question": "question text", "answer": "reference answer" } ``` Supported subsets in `scripts/run_atomicrag.py`: - `medical` - `novel` - `hotpotqa` - `musique` - `2wikimultihop` ## Quick Start Run AtomicRAG on the Medical subset: ```bash python scripts/run_atomicrag.py \ --subset medical \ --base_dir ./workspaces/atomicrag \ --results_dir ./results/atomicrag \ --model_name gpt-4o-mini \ --embed_model_path BAAI/bge-large-en-v1.5 \ --llm_base_url "$LLM_BASE_URL" \ --use_cache false \ --concurrency 200 \ --qa_prompt_template precise ``` Use cache after a graph has already been built: ```bash python scripts/run_atomicrag.py \ --subset medical \ --base_dir ./workspaces/atomicrag \ --results_dir ./results/atomicrag \ --model_name gpt-4o-mini \ --embed_model_path BAAI/bge-large-en-v1.5 \ --llm_base_url "$LLM_BASE_URL" \ --use_cache true \ --concurrency 200 \ --qa_prompt_template precise ``` Useful flags: - `--use_cache true`: reuse existing OpenIE, embeddings, and graph cache when available. - `--sample N`: run only a prefix sample for debugging. - `--qa_prompt_template precise`: use concise QA prompts for short-answer benchmarks. - `--disable_fragment_filter`: disable Atomic Sieve for ablation. - `--disable_query_decomposition`: disable Atomic Question Decomposition. - `--disable_ppr`: use dense retrieval without graph propagation. ## Running All Benchmarks The shell helpers in `scripts/` run common dataset groups: ```bash bash scripts/run_medical.sh bash scripts/run_novel.sh bash scripts/run_hotpotqa.sh bash scripts/run_musique.sh bash scripts/run_2wikimultihop.sh bash scripts/run_all_datasets.sh ``` Outputs are written under `results/atomicrag/` by default: ```text results/atomicrag// predictions_.json statistics__.json eval_generation_.json ``` ## Evaluation Evaluate a prediction file with the Graph-Bench-style `answer_correctness` metric: ```bash python -m Evaluation.generation_eval \ --model gpt-4o-mini \ --base_url "$LLM_BASE_URL" \ --embedding_model BAAI/bge-large-en-v1.5 \ --data_file results/atomicrag/Medical/predictions_Medical.json \ --output_file results/atomicrag/Medical/eval_generation_Medical.json \ --concurrency 200 ``` Average the 20 Novel corpus evaluations: ```bash python scripts/average_novel_results.py \ --results_dir results/atomicrag \ --output_file results/atomicrag/eval_generation_novel_averaged.json ``` ## Paper Default Configuration The implementation defaults are aligned with the paper configuration. | Component / parameter | Value | |---|---:| | LLM | `gpt-4o-mini` | | Embedding model | `BAAI/bge-large-en-v1.5` | | `embedding_max_seq_len` | `2048` | | `retrieval_top_k` | `25` | | `qa_top_k` | `25` | | `synonymy_edge_topk` | `2047` | | `synonymy_edge_sim_threshold` | `0.8` | | `entity_node_weight` | `1.0` | | `entity_top_k` | `20` | | `entity_sim_threshold` | `0.3` | | Propagation method | `ppr` | | `damping` | `0.3` | | `passage_node_weight` | `0.1` | | `max_sub_questions` | `3` | | `complexity_threshold` | `6.5` | | Chunk size / overlap | `256 / 32` | ## Reported Main Results The paper reports Answer Accuracy (ACC, %) on Graph-Bench and multi-hop QA benchmarks. | Benchmark | Fact | Reason | Summ. | Creat. | Avg. | |---|---:|---:|---:|---:|---:| | Graph-Bench Medical | 72.6 | 74.8 | 76.8 | 68.3 | 73.1 | | Graph-Bench Novel | 61.0 | 53.4 | 68.5 | 60.0 | 60.7 | | Benchmark | HotpotQA | 2WikiMultiHopQA | MuSiQue | Avg. | |---|---:|---:|---:|---:| | Multi-hop QA | 70.5 | 56.8 | 50.9 | 59.4 | Overall average across reported task columns: **64.9**. ## Citation If you use AtomicRAG in your work, please cite: ```bibtex @misc{hou2026atomicrag, title = {AtomicRAG: Atom-Entity Graphs for Retrieval-Augmented Generation}, author = {Yanning Hou and Duanyang Yuan and Sihang Zhou and Xiaoshu Chen and Ke Liang and Siwei Wang and Xinwang Liu and Jian Huang}, year = {2026}, eprint = {2604.20844}, archivePrefix = {arXiv}, primaryClass = {cs.IR}, url = {https://arxiv.org/abs/2604.20844} } ``` ## Notes - The package name is `atomicrag`. - The current codebase is a research implementation; generated graph caches can be large. - Personalized PageRank uses `python-igraph`; graph operations are guarded by locks because concurrent igraph calls are not thread-safe. - For reproducibility, keep the model, embedding model, prompt template, cache mode, and concurrency settings fixed across runs.