{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "b1a2c3d4",
   "metadata": {},
   "source": [
    "# Qwen2.5-7B Pretraining Verification\n",
    "\n",
    "This notebook verifies the pretraining capability of the **Ascend 910B CANN image** by running Qwen2.5-7B pretraining with MindSpeed-LLM.\n",
    "\n",
    "**Workflow:**\n",
    "1. Check the environment\n",
    "2. Prepare the pretraining dataset\n",
    "3. Clone the MindSpeed-LLM scripts\n",
    "4. Convert HF weights to Megatron weights\n",
    "5. Preprocess the data\n",
    "6. Start pretraining\n",
    "\n",
    "> Training parameters are set for verification mode (few iterations + short sequences). Increase `TRAIN_ITERS` and `SEQ_LENGTH` for production runs."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2d3e4f5",
   "metadata": {},
   "source": [
    "## 0. Parameter Configuration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3e4f5a6",
   "metadata": {},
   "outputs": [],
   "source": "import warnings\nwarnings.filterwarnings('ignore', category=DeprecationWarning)\nwarnings.filterwarnings('ignore', category=ImportWarning)\nwarnings.filterwarnings('ignore', category=UserWarning)\n\nfrom pathlib import Path\n\n# ===== Path configuration =====\nHF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen2.5-7B')\nWORK_DIR = Path('/opt/app-root/src/Qwen2.5-7B-work-dir')\nMINDSPEED_LLM_DIR = WORK_DIR / 'MindSpeed-LLM'\nDATA_DIR = WORK_DIR / 'pretrain_dataset'\nOUTPUT_DIR = WORK_DIR / 'output' / 'qwen25_7b_pretrained'\nLOGS_DIR = WORK_DIR / 'logs'\n\n# ===== Optional: real dataset path =====\nALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n\n# ===== Ascend environment scripts =====\nCANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\nATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n\n# ===== Parallel configuration (must match weight conversion) =====\nTP = 1   # Tensor parallelism\nPP = 4   # Pipeline parallelism; requires at least TP*PP=4 NPUs\n# Note: With TP=1, each PP stage has about 1.6-2.2B parameters. The AdamW optimizer states\n# (exp_avg + exp_avg_sq, fp32) take about 17.6 GiB, and weights plus gradients exceed the 29 GiB memory on 910B.\n# Enable --use-distributed-optimizer (ZeRO-1) during training to shard optimizer states by the DP dimension and reduce memory usage.\n\n# ===== Weight conversion output (path includes parallel config to avoid reusing old weights after TP/PP changes) =====\nMCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen25_mcore_tp{TP}_pp{PP}'\n\n# ===== Training hyperparameters (verification mode) =====\nSEQ_LENGTH = 512     # Recommended production value: 4096\nTRAIN_ITERS = 50     # Recommended production value: 2000+\nMBS = 1\nLR = 1.25e-6\nMIN_LR = 1.25e-7\n\n# ===== Data preprocessing =====\nPROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\nDATA_PATH = str(DATA_DIR / 'alpaca_text_document')  # preprocess_data.py automatically adds the _text_document suffix\n\nprint('Configuration loaded')\nprint(f'  Model: {HF_MODEL_DIR}')\nprint(f'  Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else '  Dataset: not found; sample data will be used')\nprint(f'  TP={TP}, PP={PP}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')"
  },
  {
   "cell_type": "markdown",
   "id": "e4f5a6b7",
   "metadata": {},
   "source": [
    "## Helper Functions"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5a6b7c8",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import subprocess\n",
    "\n",
    "_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning'\n",
    "\n",
    "def run_cmd(cmd, cwd=None, check=True):\n",
    "    'Run a bash command in the Ascend environment and stream output in real time'\n",
    "    env_prefix = f'source {CANN_ENV} && source {ATB_ENV}'\n",
    "    full_cmd = f'{env_prefix} && {cmd}'\n",
    "    print(f'$ {cmd}\\n')\n",
    "    run_env = os.environ.copy()\n",
    "    run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n",
    "    result = subprocess.run(\n",
    "        ['bash', '-lc', full_cmd],\n",
    "        cwd=str(cwd or WORK_DIR),\n",
    "        text=True,\n",
    "        env=run_env,\n",
    "    )\n",
    "    if check and result.returncode != 0:\n",
    "        raise RuntimeError(f'Command failed with return code: {result.returncode}')\n",
    "    return result\n",
    "\n",
    "print('Helper function defined: run_cmd()')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6b7c8d9",
   "metadata": {},
   "source": [
    "## 1. Environment Check"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7c8d9e0",
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter('ignore', DeprecationWarning)\n",
    "    warnings.simplefilter('ignore', ImportWarning)\n",
    "    warnings.simplefilter('ignore', UserWarning)\n",
    "    import torch\n",
    "    import torch_npu\n",
    "\n",
    "print('=' * 60)\n",
    "print('Environment check')\n",
    "print('=' * 60)\n",
    "\n",
    "# PyTorch & NPU\n",
    "print(f'PyTorch:    {torch.__version__}')\n",
    "print(f'torch_npu:  {torch_npu.__version__}')\n",
    "nproc = torch.npu.device_count()\n",
    "print(f'NPU count:  {nproc}')\n",
    "for i in range(nproc):\n",
    "    print(f'  NPU {i}: {torch.npu.get_device_name(i)}')\n",
    "\n",
    "# MindSpeed\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter('ignore', DeprecationWarning)\n",
    "    warnings.simplefilter('ignore', ImportWarning)\n",
    "    warnings.simplefilter('ignore', UserWarning)\n",
    "    import mindspeed\n",
    "    import mindspeed_llm\n",
    "\n",
    "print('MindSpeed:     installed')\n",
    "print('MindSpeed-LLM: installed')\n",
    "\n",
    "# Model files\n",
    "print(f'\\nModel directory: {HF_MODEL_DIR}')\n",
    "assert HF_MODEL_DIR.exists(), f'Model directory does not exist: {HF_MODEL_DIR}'\n",
    "model_files = sorted(HF_MODEL_DIR.glob('*'))\n",
    "for f in model_files[:5]:\n",
    "    if f.is_file():\n",
    "        print(f'  {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\n",
    "if len(model_files) > 5:\n",
    "    print(f'  ... {len(model_files)} files in total')\n",
    "\n",
    "# Parallel configuration check\n",
    "assert nproc >= TP * PP, f'NPU count({nproc}) < TP*PP({TP*PP}); reduce PP'\n",
    "DP = nproc // (TP * PP)\n",
    "GBS = DP * MBS\n",
    "print(f'\\nParallel configuration: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n",
    "\n",
    "assert torch.npu.is_available(), 'NPU is not available'\n",
    "print('\\nEnvironment check passed!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8d9e0f1",
   "metadata": {},
   "source": [
    "## 2. Prepare the Pretraining Dataset\n",
    "\n",
    "Pretraining uses raw text data. MindSpeed-LLM's `preprocess_data.py` supports `.parquet`, `.json`, `.jsonl`, `.txt`, and other formats.\n",
    "\n",
    "This example uses the Alpaca dataset in parquet format, which contains a `text` field. If you use another dataset, make sure it contains a `text` field or uses plain text format."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9e0f1a2",
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import warnings\n",
    "\n",
    "DATA_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "if ALPACA_PARQUET.exists():\n",
    "    print(f'Dataset is ready: {ALPACA_PARQUET.name}')\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter('ignore', DeprecationWarning)\n",
    "        import pandas as pd\n",
    "        df = pd.read_parquet(ALPACA_PARQUET)\n",
    "    print(f'{len(df)} samples, columns: {list(df.columns)}')\n",
    "    print('\\nSample examples:')\n",
    "    for i, row in df.head(3).iterrows():\n",
    "        text = str(row.get('text', ''))[:100]\n",
    "        print(f'  [{i}] {text}...')\n",
    "    DATA_INPUT = str(ALPACA_PARQUET)\n",
    "else:\n",
    "    print('Alpaca dataset not found. Creating sample text data\\n')\n",
    "    sample_texts = [\n",
    "        {'text': 'Natural language processing is an important branch of artificial intelligence that studies how computers understand and generate human language.'},\n",
    "        {'text': 'Deep learning uses multilayer neural networks to learn hierarchical data representations and is widely used in computer vision and natural language processing.'},\n",
    "        {'text': 'Python is a high-level programming language known for its concise, readable syntax and rich ecosystem.'},\n",
    "        {'text': 'Machine learning is a core artificial intelligence technology that enables computers to learn and improve automatically from data.'},\n",
    "        {'text': 'Ascend 910B is an artificial intelligence processor from Huawei designed for deep learning training and inference workloads.'},\n",
    "        {'text': 'Pretraining is the first stage of training large language models and learns statistical patterns from massive text corpora.'},\n",
    "        {'text': 'The Transformer architecture is a foundation of modern natural language processing and uses self-attention for parallel sequence modeling.'},\n",
    "        {'text': 'Distributed training spreads large model training workloads across multiple compute devices.'},\n",
    "        {'text': 'Tensor parallelism and pipeline parallelism are two common parallelism strategies for large model training.'},\n",
    "        {'text': 'Gradient accumulation simulates large-batch training when memory is limited.'},\n",
    "    ]\n",
    "    SAMPLE_FILE = DATA_DIR / 'sample_pretrain.jsonl'\n",
    "    with open(SAMPLE_FILE, 'w', encoding='utf-8') as f:\n",
    "        for item in sample_texts:\n",
    "            f.write(json.dumps(item, ensure_ascii=False) + '\\n')\n",
    "    DATA_INPUT = str(SAMPLE_FILE)\n",
    "    print(f'Sample data created: {SAMPLE_FILE}')\n",
    "    print(f'{len(sample_texts)} samples')\n",
    "\n",
    "print(f'\\nData input path: {DATA_INPUT}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e0f1a2b3",
   "metadata": {},
   "source": [
    "## 3. Clone MindSpeed-LLM\n",
    "\n",
    "The `mindspeed_llm` Python package is installed during image build, but the training scripts (`convert_ckpt.py`, `preprocess_data.py`, `pretrain_gpt.py`, and others) must run from the repository directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f1a2b3c4",
   "metadata": {},
   "outputs": [],
   "source": [
    "if MINDSPEED_LLM_DIR.exists():\n",
    "    print(f'Already exists: {MINDSPEED_LLM_DIR}')\n",
    "else:\n",
    "    print('Cloning MindSpeed-LLM (shallow clone)...')\n",
    "    run_cmd(f'git clone --depth 1 https://gitcode.com/ascend/MindSpeed-LLM.git {MINDSPEED_LLM_DIR}')\n",
    "\n",
    "# Verify required scripts\n",
    "scripts = [\n",
    "    ('Weight conversion', 'convert_ckpt.py'),\n",
    "    ('Data preprocessing', 'preprocess_data.py'),\n",
    "    ('Pretraining', 'pretrain_gpt.py'),\n",
    "]\n",
    "for name, script in scripts:\n",
    "    exists = (MINDSPEED_LLM_DIR / script).exists()\n",
    "    print(f'  [{name}] {script}: {\"OK\" if exists else \"MISSING\"}')\n",
    "\n",
    "assert all((MINDSPEED_LLM_DIR / s).exists() for _, s in scripts), 'Required scripts are missing'\n",
    "print('\\nScript check passed!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a2b3c4d5",
   "metadata": {},
   "source": [
    "## 4. Convert HF Weights to Megatron Weights\n",
    "\n",
    "Convert HuggingFace-format weights to Megatron-Mcore format, split by TP/PP. The first conversion usually takes 5-10 minutes.\n",
    "\n",
    "Qwen2.5 is based on the LLaMA2 architecture with QKV bias, so the conversion uses `--model-type-hf llama2` plus `--add-qkv-bias`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3c4d5e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "# Check whether conversion already exists\n",
    "converted = any(MCORE_WEIGHTS_DIR.glob('iter_*'))\n",
    "\n",
    "if converted:\n",
    "    print(f'Weights already exist; skipping conversion: {MCORE_WEIGHTS_DIR}')\n",
    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
    "        print(f'  {p.name}')\n",
    "else:\n",
    "    convert_cmd = ' && '.join([\n",
    "        f'cd {MINDSPEED_LLM_DIR}',\n",
    "        'python convert_ckpt.py'\n",
    "        ' --use-mcore-models'\n",
    "        ' --model-type GPT'\n",
    "        ' --load-model-type hf'\n",
    "        ' --save-model-type mg'\n",
    "        f' --target-tensor-parallel-size {TP}'\n",
    "        f' --target-pipeline-parallel-size {PP}'\n",
    "        ' --add-qkv-bias'\n",
    "        f' --load-dir {HF_MODEL_DIR}'\n",
    "        f' --save-dir {MCORE_WEIGHTS_DIR}'\n",
    "        f' --tokenizer-model {HF_MODEL_DIR / \"tokenizer.json\"}'\n",
    "        ' --model-type-hf llama2'\n",
    "        ' --params-dtype bf16',\n",
    "    ])\n",
    "    print('Running weight conversion (about 5-10 minutes)...')\n",
    "    run_cmd(convert_cmd, cwd=MINDSPEED_LLM_DIR)\n",
    "    print('Weight conversion completed!')\n",
    "    for p in sorted(MCORE_WEIGHTS_DIR.iterdir()):\n",
    "        print(f'  {p.name}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4d5e6f7",
   "metadata": {},
   "source": [
    "## 5. Data Preprocessing\n",
    "\n",
    "Convert text data into the binary format (`.bin` + `.idx`) required by MindSpeed-LLM pretraining.\n",
    "\n",
    "No handler needs to be specified for pretraining data processing. `preprocess_data.py` automatically extracts the `text` field and generates `alpaca_text_document.bin` and `alpaca_text_document.idx`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d5e6f7a8",
   "metadata": {},
   "outputs": [],
   "source": [
    "preprocess_cmd = ' && '.join([\n",
    "    f'cd {MINDSPEED_LLM_DIR}',\n",
    "    'python preprocess_data.py'\n",
    "    f' --input {DATA_INPUT}'\n",
    "    f' --tokenizer-name-or-path {HF_MODEL_DIR}'\n",
    "    f' --output-prefix {PROCESSED_DATA_PREFIX}'\n",
    "    ' --tokenizer-type PretrainedFromHF'\n",
    "    ' --workers 4'\n",
    "    ' --log-interval 1000',\n",
    "])\n",
    "\n",
    "print('Running data preprocessing...')\n",
    "run_cmd(preprocess_cmd, cwd=MINDSPEED_LLM_DIR)\n",
    "\n",
    "# Verify outputs\n",
    "print('\\nPreprocessing outputs:')\n",
    "for f in sorted(DATA_DIR.glob('alpaca*')):\n",
    "    print(f'  {f.name} ({f.stat().st_size / 1024:.1f} KB)')\n",
    "\n",
    "assert (DATA_DIR / 'alpaca_text_document.bin').exists() or (DATA_DIR / 'alpaca_text_document.idx').exists(), \\\n",
    "    f'Preprocessing outputs not found: {DATA_DIR / \"alpaca_text_document.*\"}'\n",
    "print('Data preprocessing completed!')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6f7a8b9",
   "metadata": {},
   "source": [
    "## 6. Start Pretraining\n",
    "\n",
    "Run Qwen2.5-7B pretraining with MindSpeed-LLM. Training logs are streamed to the notebook in real time.\n",
    "\n",
    "> In verification mode, `TRAIN_ITERS=50`. For full pretraining, 2000+ iterations are recommended.\n",
    "\n",
    "**Qwen2.5-7B architecture parameters:**\n",
    "- 28 Transformer layers, hidden size 3584, FFN size 18944\n",
    "- 28 attention heads, 4 KV heads (GQA)\n",
    "- RoPE positional encoding, SwiGLU activation, RMSNorm, QKV bias"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7a8b9c0",
   "metadata": {},
   "outputs": [],
   "source": "import torch\n\nnproc = torch.npu.device_count()\nDP = nproc // (TP * PP)\nGBS = DP * MBS\n\nLOGS_DIR.mkdir(parents=True, exist_ok=True)\nOUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n\n# Environment variables\nenv = ' && '.join([\n    f'cd {MINDSPEED_LLM_DIR}',\n    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n    'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n])\n\n# torchrun distributed arguments\ndistributed = ' '.join([\n    'torchrun',\n    f'--nproc_per_node {nproc}',\n    '--nnodes 1 --node_rank 0',\n    '--master_addr localhost --master_port 6000',\n])\n\n# Model architecture (Qwen2.5-7B)\nmodel_args = ' '.join([\n    '--use-mcore-models',\n    f'--tensor-model-parallel-size {TP}',\n    f'--pipeline-model-parallel-size {PP}',\n    '--sequence-parallel --use-flash-attn',\n    '--transformer-impl local',\n    '--use-distributed-optimizer',\n    '--num-layers 28 --hidden-size 3584 --num-attention-heads 28',\n    '--ffn-hidden-size 18944 --max-position-embeddings 131072',\n    f'--seq-length {SEQ_LENGTH}',\n    '--make-vocab-size-divisible-by 1 --padded-vocab-size 152064',\n    '--position-embedding-type rope --rotary-base 1000000 --use-rotary-position-embeddings',\n    '--group-query-attention --num-query-groups 4',\n    '--add-qkv-bias --disable-bias-linear',\n    '--untie-embeddings-and-output-weights',\n    '--swiglu --normalization RMSNorm --norm-epsilon 1e-6',\n])\n\n# Training hyperparameters\ntrain_args = ' '.join([\n    f'--micro-batch-size {MBS} --global-batch-size {GBS}',\n    f'--train-iters {TRAIN_ITERS}',\n    '--lr-decay-style cosine --lr-warmup-fraction 0.01',\n    '--init-method-std 0.01',\n    f'--lr {LR} --min-lr {MIN_LR}',\n    '--weight-decay 1e-1 --clip-grad 1.0',\n    '--adam-beta1 0.9 --adam-beta2 0.95 --initial-loss-scale 4096',\n    '--no-gradient-accumulation-fusion --attention-softmax-in-fp32',\n    '--no-masked-softmax-fusion --no-load-optim --no-load-rng',\n    '--seed 42 --bf16',\n])\n\n# Activation recomputation: recompute forward activations during backward pass to trade compute for memory.\n# With PP=4, each stage has 7 layers, so recomputing all layers maximizes memory savings.\nrecompute_args = ' '.join([\n    '--recompute-granularity full',\n    '--recompute-method block',\n    '--recompute-num-layers 7',\n])\n\n# Data and outputs\ndata_args = ' '.join([\n    f'--data-path {DATA_PATH}',\n    '--split 100,0,0',\n    '--tokenizer-type PretrainedFromHF',\n    f'--tokenizer-name-or-path {HF_MODEL_DIR}',\n    '--log-interval 1',\n    f'--save-interval {TRAIN_ITERS}',\n    f'--eval-interval {TRAIN_ITERS} --eval-iters 0',\n])\n\n# Load and save\noutput_args = ' '.join([\n    f'--load {MCORE_WEIGHTS_DIR} --save {OUTPUT_DIR}',\n    '--distributed-backend nccl',\n    '--exit-on-missing-checkpoint',\n    '--no-save-optim --no-save-rng',\n])\n\ncmd = f'{env} && {distributed} pretrain_gpt.py {model_args} {train_args} {recompute_args} {data_args} {output_args}'\n\nprint(f'Training configuration: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}')\nprint(f'GBS={GBS}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\nprint(f'Activation recomputation: full (7 layers per PP stage)')\nprint(f'\\nStarting pretraining...\\n')\nrun_cmd(cmd, cwd=MINDSPEED_LLM_DIR)\nprint(f'\\nPretraining completed! Weights saved to: {OUTPUT_DIR}')"
  },
  {
   "cell_type": "markdown",
   "id": "a8b9c0d1",
   "metadata": {},
   "source": [
    "## Use a Real Dataset\n",
    "\n",
    "After verification succeeds, use a real dataset for full pretraining as follows:\n",
    "\n",
    "1. **Prepare the data**: place the text dataset inside the container\n",
    "   - Supported formats: `.parquet`, `.json`, `.jsonl`, `.txt`\n",
    "   - The data must contain a `text` field (parquet/json/jsonl) or one text segment per line (txt)\n",
    "\n",
    "2. **Adjust parameters**:\n",
    "   - `SEQ_LENGTH = 4096` to match the model context length\n",
    "   - `TRAIN_ITERS = 2000+` depending on dataset size\n",
    "   - `GBS` based on the NPU count and dataset size; it can be set larger than `DP * MBS` to enable gradient accumulation\n",
    "\n",
    "3. **Save interval**: modify `--save-interval` in the training cell for periodic checkpoints\n",
    "\n",
    "4. **Weight conversion**: if TP/PP changes, rerun weight conversion"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}