Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2024)

Hui Zheng^*,2,4, Hai-Teng Wang^*,1, Wei-Bang Jiang³, Zhong-Tao Chen¹,
Li He¹, Pei-Yang Lin¹, Peng-Hu Wei⁵, Guo-Guang Zhao⁵, Yun-Zhe Liu^†,1,4
¹Beijing Normal University, ²Peking University, ³Shanghai Jiao Tong University,
⁴Chinese Institute for Brain Research, ⁵Capital Medical University, Xuanwu Hospital, Beijing
*Equal contribution,†yunzhe.liu@bnu.edu.cn

Abstract

Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, their performance on more difficult tasks, e.g., speech decoding, which demands intricate processing in specific brain regions, is yet to be fully investigated. We hypothesize that building multi-variate representations within certain brain regions can better capture the specific neural processing. To explore this hypothesis, we collect a well-annotated Chinese word-reading sEEG dataset, targeting language-related brain networks, over 12 subjects. Leveraging this benchmark dataset, we developed the Du-IN¹¹1Du-IN refers to the phonetic transcription of ”{CJK}UTF8bsmi讀音” (i.e., pronunciation) in Chinese. model that can extract contextual embeddings from specific brain regions through discrete codebook-guided mask modeling. Our model achieves SOTA performance on the downstream 61-word classification task, surpassing all baseline models. Model comparison and ablation analysis reveal that our design choices, including (i) multi-variate representation by fusing channels in vSMC and STG regions and (ii) self-supervision by discrete codebook-guided mask modeling, significantly contribute to these performances. Collectively, our approach, inspired by neuroscience findings, capitalizing on multi-variate neural representation from specific brain regions, is suitable for invasive brain modeling. It marks a promising neuro-inspired AI approach in BCI.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (1)

1 Introduction

Brain signals refer to the biometric information collected from the brain. Their patterns provide valuable insights toward understanding the physiological functions of the brain and the mechanism of related diseases, leading to various applications, including speech decoding cho2023neural ; duan2023dewave ; moses2021neuroprosthesis , sleep cognition research liu2022bstt ; zheng2023universal , neurological disorders detection jiang2024large ; zhang2024brant , and so on. Due to the high signal-noise ratio, invasive recording methods (e.g., stereoElectroEncephaloGraphy (sEEG), ElectroCorticoGraphy (ECoG)) usually reveal these underlying mechanisms better than non-invasive recording methods. Compared with ECoG, sEEG imposes less trauma on patients and provides more stereotactic information from specific brain regions. Although some researchers moses2021neuroprosthesis ; metzger2023high have recently built high-performance speech decoders based on ECoG, there are few attempts made to build speech decoders based on sEEG.

Modeling intracranial neural signals, especially sEEG, has drawn much research attention, but several issues remain unresolved. Current research on modeling intracranial neural signals is predominantly divided into two lines according to the basic modeling unit (i.e., a single channel or a group of channels). Some studies wang2023brainbert ; zhang2024brant embed a single channel to build univariate representations, focusing primarily on downstream tasks that involve single-channel prediction, e.g., epilepsy detection. However, they haven’t validated their methods on more challenging tasks requiring integrating multiple channels for prediction, e.g., speech decoding. Other studies angrick2021real ; feng2023high fuse a group of channels to build multi-variate representations, mostly relying on fully supervised methods that heavily rely on labeled data. Nonetheless, labeling data at scale in medical experiments is often impractical or costly, emphasizing the need to maximize label efficiency. To overcome these limitations, self-supervised pre-training followed by fine-tuning can leverage abundant unlabeled data, enhancing performance on various downstream tasks.

The primary challenge in modeling intracranial neural signals lies in extracting meaningful tokens, requiring careful consideration of two key factors. (1) Temporal scale. Since intracranial neural signals have high temporal resolution and signal-noise ratio, these tokens must capture rapid dynamic changes in brain activity. (2) Spatial scale. Since the functions of different brain regions are relatively specific, these tokens should correctly capture the information of each brain region for further integration. To better assess how well different models capture the intricate processing within each brain region, we can evaluate these methods on tasks mainly involving a few brain regions.

Since speech mainly involves specific brain regions related to vocal production, as demonstrated in Section 2.1, we utilize speech decoding tasks to evaluate which model can effectively extract information from specific brain regions. Due to the lack of an open-source sEEG language dataset, we collected a well-annotated Chinese word-reading sEEG dataset (vocal production), including 12 subjects, which makes up for the problem of missing sEEG recordings in language tasks. Inspired by neuroscientific findings, we systematically demonstrate the locality and specificity of brain computation and propose the Du-IN model to solve the abovementioned issues. Compared to other existing methods for modeling brain signals, Du-IN achieves SOTA performance on the 61-word classification task, demonstrating the effectiveness of our model in extracting meaningful tokens that can capture both the rapid changes and the precise state of specific brain regions. It marks a promising neuro-inspired AI approach saxe2021if ; richards2019deep in BCI.

To sum up, the main contributions of our work comprise:

1.
A well-annotated Chinese word-reading sEEG dataset, addressing the lack of sEEG language dataset. The dataset will be publicly available.
2.
Demonstration of brain-specific computation – achieving the best decoding performance only requires about one electrode in specific brain regions (i.e., vSMC, STG).
3.
A novel framework for sEEG speech decoding – Du-IN, which learns multi-variate contextual embeddings through discrete codebook-guided mask modeling.
4.
SOTA performance on the sEEG speech decoding task – Du-IN achieves 62.70% top-1 accuracy on the 61-word classification task, surpassing all other baselines.

2 Related Works

2.1 Neural Basis of Language Function

Neuroscientific research bouchard2013functional ; dichter2018control ; sheng2019cortical in the past has extensively explored brain regions supporting language functionality. In neuroscience, the investigation into language functionality related to speech has been categorized into two main streams: one dedicated to semantic processing and the other to vocal production. Previous studies binder1997human ; sheng2019cortical have shown that brain regions associated with semantic processing primarily include left inferior frontal gyrus (IFG), left anterior temporal lobe (ATL), and bilateral middle temporal gyrus (MTG).

As for vocal production, which is also the focus of our work, it is predominantly governed by motor information related to language articulation, primarily involving ventral sensorimotor cortex (vSMC), bilateral superior temporal gyrus (STG), and bilateral dorsal laryngeal motor cortex (dLMC) bouchard2013functional ; dichter2018control ; chartier2018encoding . Our analysis results based on our collected word-reading sEEG dataset also confirm this point, as illustrated in Figure 4.

2.2 Language Decoding in BCI

The keys to decoding natural language from brain signals are (1) high-quality recordings, and (2) well-designed models with good representations. Compared to non-invasive recordings (e.g., EEG), invasive recordings manifest advantages in providing detailed information about specific brain regions with a high signal-noise ratio. Since speech mainly involves some specific brain regions, obtaining detailed recordings of these brain regions will significantly enhance the decoding performance. Existing works cho2023neural ; moses2021neuroprosthesis ; feng2023high have shown the great potential of building a high-performance decoder based on invasive recordings.

The other key is well-designed models with good representations. Existing work for brain-to-language representations can be classified into two categories: self-supervision or alignment with representation models pre-trained on other modalities (e.g., text, audio). BrainBERT wang2023brainbert learns general embeddings through self-supervised mask modeling. DeWave duan2023dewave introduces discrete codex encoding and aligns neural representations with text embeddings from BART lewis2019bart , thus enhancing the extraction of semantic processing-related information from EEG recordings. Metzger1 et al. metzger2023high align neural representations with acoustic embeddings to improve the extraction of vocal production-related information from ECoG recordings.

2.3 Self-supervised Learning in BCI

In recent years, self-supervised pre-training has made significant progress in natural language processing devlin2018bert ; radford2018improving ; brown2020language and computer vision bao2021beit ; he2022masked ; chen2020generative . However, its potential in BCI is far from being explored. BrainBERT (for sEEG) wang2023brainbert builds univariate representations based on a single channel and utilizes mask modeling to learn general representations. Brant (for sEEG) zhang2024brant and LaBraM (for EEG) jiang2024large also build univariate representations but further consider the spatial correlation among channels. BENDR (for EEG) kostas2021bendr takes the other way – building multi-variate representations by fusing all channels, and it uses contrastive learning to learn contextual representations. MMM (for EEG) yi2024learning also builds multi-variate representations but further considers the difference among brain regions (i.e., splitting channels into different groups).

All existing pre-train methods for sEEG primarily construct univariate representations and some of them further employ Transformer models to capture relationships among channels. However, unlike EEG pre-training methods, their effectiveness over multivariate representations has not been experimentally proven. Besides, there is no standard channel configuration for sEEG recordings, unlike EEG recordings, which makes modeling spatial relationships in sEEG more challenging.

3 Method

The overall architecture of Du-IN is illustrated in Figure 2, where the raw sEEG signals are fused across channels to build multi-variate representations and further encoded for downstream tasks.

3.1 Task Definition

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. moses2021neuroprosthesis to collect a well-annotated Chinese word-reading sEEG dataset (vocal production). During the experiment, each subject speaks aloud 61 pre-determined Chinese words 50 times; see Appendix A for more details. We formulate the multi-channel sEEG signals as $\mathcal{X}\in\mathbb{R}^{C\times T}$ , where $C$ is the number of sEEG channels and $T$ is the total timestamps. The associated word label is denoted as $\bm{y}\in\mathcal{Y}$ , where $\mathcal{Y}$ represents the set of 61 pre-determined words. In summary, this dataset comprises paired sEEG-word data ( $\langle\mathcal{X},\bm{y}\rangle$ ), and the model aims to decode the corresponding word $\bm{y}$ from a sequence of raw sEEG signals $\mathcal{X}$ .

3.2 Model Architecture

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2)

We introduce the neural Transformer, a general architecture for sEEG speech decoding tasks that can deal with any input sEEG signals with arbitrary time length, as shown in Figure 2. The key operation for archiving this is segmenting the sEEG signals into patches, inspired by patch embeddings in images dosovitskiy2020image . For each sample $\mathcal{X}$ , we use a $W$ -length window without overlap to segment it into patches, obtaining $\mathcal{X}=\{\bm{x}_{i}\in\mathbb{R}^{C\times W}|i=1,...,N\}$ , where $N=\lfloor\frac{T}{W}\rfloor$ is the number of patches.

Spatial Encoder.

As each sEEG patch has multiple channels, it is vital to fuse different channels to extract meaningful features before patch-wise interaction by self-attention. We employ a spatial encoder, which consists of a linear projection and several convolution blocks, to encode each sEEG patch into a patch embedding. The linear projection transforms the raw sEEG signals into the hidden neural space, and its weights are utilized for subsequent analysis. The convolution block is composed of a 1-D convolution layer and a batch normalization layer ioffe2015batch . We denote the output patch embeddings from the spatial encoder as

\mathcal{E}_{p}=\{\bm{e}^{p}_{i}\in\mathbb{R}^{d}|i=1,...,N\},

(1)

where $d$ is the dimension of the embeddings.

Temporal Embedding.

In order to enable the model to be aware of the temporal information of patch embeddings, we utilize the parameter-free position embeddings introduced in vaswani2017attention , i.e., $\mathcal{E}_{t}=\{\bm{e}^{t}_{1},...,\bm{e}^{t}_{t_{max}}\}$ . Note that $t_{max}$ is the hyperparameter determining the maximum number of time patches and $t_{max}\geq N$ . Given one arbitrary patch embedding $\bm{e}_{i}$ in Equation 1 from the spatial encoder, we add the corresponding temporal embedding to it:

\mathcal{E}_{init}=\{\bm{e}^{p}_{i}+\bm{e}^{t}_{i}|i=1,...,N\},

(2)

which forms the input embeddings $\mathcal{E}_{init}$ for the Transformer Encoder.

Transformer Encoder.

Finally, the sequence of embeddings will be directly fed into the Transformer encoder vaswani2017attention to get the final encoded $\mathcal{E}=\{\bm{e}_{i}\in\mathbb{R}^{d}|i=1,...,N\}$ . To make the training of the Transformer more stable and efficient, we incorporate some modifications dehghani2023scaling . We add layer normalization to the queries and keys before the dot-product attention mechanism, which avoids over-large values in attention logits:

\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{\mathrm{LN}(Q)\mathrm{LN}(K)^%{T}}{\sqrt{d_{head}}})V,

(3)

where $d_{head}$ is the dimension of attention head and $\mathrm{LN}$ denotes layer normalization ba2016layer . For downstream classification tasks, we flatten the output embeddings followed by a classification head.

3.3 Neural Tokenizer Training

Prior to pre-training DuIR through mask modeling, we need to tokenize the sEEG patches into discrete tokens. We introduce vector-quantized neural signal regression, which is trained by reconstructing the original sEEG signals, as shown in Figure 3. The key components are the neural tokenizer, which encodes the raw sEEG samples into embeddings, and the neural regressor, which reconstructs the original sEEG signals. The idea is basically inspired by VQ-VAE van2017neural , which encodes images into discrete latent embeddings.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (3)

Neural Tokenizer.

We define a neural codebook $\mathcal{C}=\{\bm{c}_{i}|i=1,...,N_{codex}\}\in\mathbb{R}^{N_{codex}\times d_{%codex}}$ , where $N_{codex}$ is the number of the discrete neural embeddings and $d_{codex}$ is the dimension of each embedding. Given a sEEG sample $\mathcal{X}$ , the neural tokenizer (i.e., the neural transformer illustrated in Figure 2) first encodes it to embeddings $\mathcal{E}=\{\bm{e}_{i}\in\mathbb{R}^{d}|i=1,...,N\}$ . After that, we utilize a linear projection $\bm{\mathrm{z}}_{c}$ get the mapped embeddings $\bm{\mathrm{z}}_{c}(\mathcal{E})=\{\bm{\mathrm{z}}_{c}(\bm{e}_{i})\in\mathbb{R%}^{d_{codex}}|i=1,...,N\}$ in the codebook space. Then, the codebook looks up the nearest neighbor of each embedding $\bm{\mathrm{z}}_{c}(\bm{e}_{i})$ in the neural codebook $\mathcal{C}$ . This procedure can be formulated as

\bm{\mathrm{z}}_{q}(\mathcal{E})=\{\bm{\mathrm{z}}_{q}(\bm{e}_{i})|i=1,...,N\}%,\quad\bm{\mathrm{z}}_{q}(\bm{e}_{i})=\bm{c}_{z_{i}},\quad z_{i}=\mathop{\arg%\min}\limits_{j}||\ell_{2}(\bm{\mathrm{z}}_{c}(\bm{e}_{i}))-\ell_{2}(\bm{c}_{j%})||_{2},

(4)

where $\ell_{2}$ represents $\ell_{2}$ normalization and $\bm{\mathrm{z}}_{q}(\bm{e}_{i})$ is the quantized vector after the quantizer. This is equivalent to finding the closest neural embedding by cosine similarity and such $\ell_{2}$ normalization improves the codebook utilization peng2022beitv2 .

Neural Regressor.

The neural regressor consists of a Transformer decoder and a stack of transposed convolution layers. Given a sequence of the vector-quantized embeddings $\mathcal{Z}=\{\bm{z}_{i}|i=1,...,N\}$ , the neural regressor convert these discrete embeddings back into raw sEEG signals $\tilde{\mathcal{X}}=\{\tilde{\bm{x}}_{i}|i=1,...,N\}$ . The mean squared error (MSE) loss is utilized to guide the regression. The total loss for training the neural tokenizer (i.e., Du-IN VQ-VAE model) is defined as:

\mathcal{L}_{vqvae}=\sum_{i=1}^{N}\Big{[}||\tilde{\bm{x}}_{i}-\bm{x}_{i}||_{2}%^{2}+||\bm{\mathrm{sg}}[\bm{\mathrm{z}}_{c}(\bm{e}_{i})]-\bm{\mathrm{z}}_{q}(%\bm{e}_{i})||_{2}^{2}+\beta||\bm{\mathrm{z}}_{c}(\bm{e}_{i})-\bm{\mathrm{sg}}[%\bm{\mathrm{z}}_{q}(\bm{e}_{i})]||_{2}^{2}\Big{]},

(5)

where $\bm{\mathrm{sg}}$ represents the stop-gradient operation, which is an identity at the forward pass and has zero gradients. To stabilize the codebook update, we use the exponential moving average strategy van2017neural .

3.4 Pre-training Du-IN

Masked sEEG Modeling.

To enforce Du-IN learning contextual representations, we propose masked sEEG modeling. The whole procedure is presented in Figure 3. As illustrated in Figure 2, given a sEEG sample $\mathcal{X}$ , the spatial encoder first transforms it to patch embeddings $\mathcal{E}_{p}=\{\bm{e}^{p}_{i}|i=1,...,N\}$ . Given these patch embeddings $\mathcal{E}_{p}$ , around 50% of patch embeddings are patch-wisely chosen and masked. The masked position is termed as $\mathcal{M}$ . Then, a shared learnable embedding $\bm{e}_{[M]}\in\mathbb{R}^{d}$ is used to replace the original patch embeddings:

\mathcal{E}_{m}=\{\bm{e}^{m}_{i}|i=1,...,N\},\quad\bm{e}^{m}_{i}=\delta(i\in%\mathcal{M})\odot\bm{e}_{[M]}+(1-\delta(i\in\mathcal{M})\odot\bm{e}^{p}_{i},

(6)

where $\delta(\cdot)$ is the indicator function. After that, the masked embeddings $\mathcal{E}_{m}$ will be added by temporal embeddings, and then fed into the Transformer encoder. The output embeddings $\mathcal{E}$ will be used to predict the corresponding neural tokens through a linear classifier:

p(z_{i}|\bm{e}_{i})=\mathrm{softmax}(\mathrm{Linear}(\bm{e}_{i})),

(7)

The training loss of mask modeling is defined as:

\mathcal{L}_{\mathcal{M}}=-\sum_{i\in\mathcal{M}}m_{i}\odot\mathrm{log}\ p(z_{%i}|\bm{e}_{i}).

(8)

Symmetric Masking.

Inspired by LaBraM jiang2024large , we further introduce a symmetric masking strategy to improve training efficiency. We calculate the inverse of the generated mask $\mathcal{M}$ , obtaining $\hat{\mathcal{M}}$ . Similarly, we use the new mask $\hat{\mathcal{M}}$ to perform the mask modeling, obtaining the mask modeling loss $\mathcal{L}_{\mathcal{M}}^{sym}$ . The total loss for pre-training the Du-IN model (i.e., Du-IN MAE model) is defined as:

\mathcal{L}_{mae}=\mathcal{L}_{\mathcal{M}}+\mathcal{L}_{\mathcal{M}}^{sym}.

(9)

4 Experiments

4.1 Dataset

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. moses2021neuroprosthesis to collect a well-annotated Chinese word-reading sEEG dataset (vocal production), including 12 subjects. The subjects undergo a surgical procedure to implant 7 to 13 invasive sEEG electrodes, each with 72 to 158 channels, in their brain. For each subject, the dataset contains 15 hours of 2000Hz recordings, 3 hours of which are task recordings.

Pre-training dataset.

For each subject, the pre-training dataset contains all sEEG recordings (with about 54 million timestamps) of that subject. To stabilize computing resource usage, the time length of sEEG sample $\mathcal{X}$ is set to 4 seconds.

Downstream dataset.

For each subject, 3 hours of the sEEG recordings are task recordings. The sEEG signals are segmented into about 3000 3-second samples, each of which is paired with the corresponding word label (from 61 pre-determined words).

4.2 Implementation Details

Preprocess.

We first filter the sEEG signals between 0.5Hz and 200Hz to remove low-frequency noise. Then, a notch filter of 50Hz is applied to avoid power-line interference. After that, all sEEG signals are resampled to 1000Hz and bi-polar re-referenced li2018optimal . Finally, we perform z-score normalization on each channel to guarantee normalized data scales across all channels.

Model Configurations.

The length of the sEEG patch is 100ms, resulting in 40 patches per sample in the pre-training dataset and 30 patches per sample in the downstream dataset. The "Spatial Encoder" contains one linear projection and three 1-D convolution layers, transforming the original sEEG patches into patch embeddings with $d=160$ . The following "Transformer Encoder" contains an 8-layer Transformer encoder with model dimension $d=160$ , inner dimension (FFN) $d_{ff}=320$ , and 8 attention heads. See Appendix C for more details.

Pre-training.

During the pre-training, we use either all sEEG recordings (15 hours) or the sEEG recordings without task recordings (12 hours) to train the Du-IN VQ-VAE and Du-IN MAE models. To enhance the robustness of learned codexes and representations, we further use data augmentation described in Appendix D. For each subject, the model is pre-trained on a Linux system with 2 CPUs (Intel Xeon Gold 6230 40-Core Processor) and 1 GPU (NVIDIA Tesla V100 32GB) for $\sim$ 1.2 days.

Fine-tuning.

During the downstream evaluation, we split the task recordings into training, validation, and testing splits with a size roughly proportional to 80%, 10%, and 10%. We also use data augmentation, as described in Appendix D, to make the most of the gathered dataset. We employ cross-entropy loss (multi-class classification) as the training loss. Our experiments are conducted on one V100 GPU by Python 3.11.7 and PyTorch 2.1.2 + CUDA 12.3. The best models are trained based on the training set, selected from the validation set, and finally evaluated on the test set. For model comparison, we report the average and standard error values (of all subjects) on six different random seeds to obtain comparable results. For the results of the subject-wise evaluation, we report the average and standard deviation values (of each subject) in Appendix I.

4.3 Channel Contribution and Selection

As demonstrated in Section 2.1, previous neuroscience studies reveal that vocal production predominantly engages specific brain regions. Given the sparse distribution of implanted sEEG electrodes (each containing 8-16 channels), it’s vital to exclude redundant electrodes unrelated to vocal production, thus improving decoding performance. We retain electrodes implanted in relevant brain regions and evaluate the performance based on the remaining electrodes. Table 1 demonstrates that excluding approximately 85% electrodes even leads to a dramatic increase in decoding performance.

Methods	# of Channels (Averaged)	Accuracy (%) $\pm$ Ste (%)
Du-IN (w/o electrode selection)	109.75	30.12 $\pm$ 5.64
Du-IN (w/ electrode selection)	12.25	55.92 $\pm$ 4.96

To further understand the detailed contribution of each channel, we analyze the weights of linear projection in the spatial encoder. In detail, we calculate the contribution scores of channels per subject and organize them accordingly, as described in Appendix H. Figure 4 demonstrates that (1) the brain regions effective for speech decoding align with findings from previous neuroscience research, and (2) our model achieves optimal decoding performance with approximately 10 channels, 80% of which originate from the same electrode. To streamline, we utilize these top 10 channels for both pre-training and downstream evaluation.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (4)

4.4 Comparasion with Other Models

Table 2 presents the results of our Du-IN model, previous sEEG self-supervised models, and other supervised baselines. The results demonstrate that our Du-IN model outperforms all baselines. It’s worth noting that the models (i.e., BrainBERT, Brant) that build univariate representations based on a single channel perform worse than the models building multi-variate representations, challenging the generalizability of current strategies to model spatial relationships among channels with Transformer.

As the BrainBERT model doesn’t consider the spatial relationships among channels, we mainly focus on understanding why Brant fails to capture the spatial relationships on the speech decoding task effectively. Compared to epilepsy detection, speech is a more difficult task, demanding intricate processing in specific brain regions. However, unlike LaBraM jiang2024large , Brant doesn’t introduce spatial embeddings to identify the spatial location of each channel. Since the electrodes are sparsely distributed in the brain and the raw sEEG signals on the same electrode are highly correlated, it’s fairly easy to identify their spatial relationships through their values. As demonstrated in iTransformer liu2023itransformer , this modeling approach is well suited for detecting time-delay events, e.g., epilepsy detection.

For speech decoding tasks, sEEG often requires bi-polar re-reference (or Laplacian re-reference) to remove the high correlations among channels, thus avoiding model overfitting wang2023brainbert . Once the correlations among channels have been removed, Brant will lose the ability to model spatial relationships among channels. Meanwhile, Brant only involves a few sEEG location configurations (i.e., 10 subjects) during its pre-training stage. Based on their methods, more sEEG location configurations should be required to learn generic relationships among channels.

Methods	Pre-trained	Model Size	Accuracy (%) $\pm$ Ste (%)
EEG-Conformersong2022eeg	✗	2.34M	45.82 $\pm$ 4.66
CNN-BiGRUmoses2021neuroprosthesis	✗	0.54M	32.04 $\pm$ 5.45
BrainBERTwang2023brainbert	✓	43.58M	7.50 $\pm$ 1.76
Brantzhang2024brant	✓	69.35M	12.42 $\pm$ 4.10
Du-IN	✗	4.38M	56.29 $\pm$ 5.20
Du-IN (vqvae+vq)	✗	4.38M	44.17 $\pm$ 4.04
Du-IN (vqvae)	✗	4.38M	58.24 $\pm$ 4.83
Du-IN (mae)	✓	4.38M	62.70 $\pm$ 4.69

4.5 Ablation Study

Self-Supervision Initialization.

As illustrated in Figure 3, the Du-IN model entails a two-stage pre-training process, wherein both the Du-IN VQ-VAE model and the Du-IN MAE model are trained. Previous studies utilize different strategies duan2023dewave ; chen2024eegformer ; jiang2024large to leverage these pre-trained models to enhance the performance of downstream tasks. Here, we evaluate these different strategies for comparison; see Appendix C.3 for detailed definitions. Table 2 shows that initializing weights from the Du-IN MAE model captures contextual embeddings effectively, resulting in the highest decoding performance.

Pre-training with/without Downstream Datasets.

During the pre-training stage, we hope that the Du-IN VQ-VAE model can extract general tokens of that brain region, thus guiding the Du-IN MAE model to learn general representations that are not specific to any particular task. Although no label data is used during the pre-training stage, to eliminate the influence of the pre-training data on downstream tasks, we compare the results with or without incorporating the downstream task dataset into the pre-training stage.Table 3 shows a slight performance drop when excluding downstream datasets. However, the decoding performance is still higher than the baseline performance without pre-training, which means that the degradation is mainly due to the decrease of the pre-training dataset. We hope that, with more pure recordings, our model can achieve better decoding performance.

Methods	Pre-training Dataset Size	Accuracy (%) $\pm$ Ste (%)
Du-IN (mae w/o downstream dataset)	12 hours per subject	60.02 $\pm$ 4.34
Du-IN (mae w/ downstream dataset)	15 hours per subject	62.70 $\pm$ 4.69

Discrete Codex.

During the neural tokenizer training stage, the Du-IN VQ-VAE model encodes sEEG patches into discrete codexes and then reconstructs the original signal from these codexes. We evaluate performance against varying codex sizes (512 to 8192) to ascertain if codex size affects the quality of the learned codebook. As illustrated in Figure 5, while extremely small codex size lacks representation diversity, extremely large codex size often leads to codebook collapse. We suspect that our existing training data might not be adequate for larger codex sizes. Furthermore, our experiments suggest that the model performs optimally when the codex dimension, denoted as $d_{codex}=64$ , is slightly less than the model dimension, $d=160$ , yielding a more effective regularization effect.

Perception Time Window.

We also conduct the ablation study on the model structure for the spatial encoder described in Section 3.2. As the spatial encoder transforms the sEEG signals within a given patch to a patch embedding, it compresses the sEEG signals for perception. As described in Section 4.2, the model utilizes a perception field of 100ms. We conduct an ablation study of different perception fields and report it in Figure 5. The model performance notably drops with a perception field smaller than 60ms and gradually declines as the perception field exceeds 160ms. The model reaches a small peak around 100ms to 140ms. We think this phenomenon is rational since sEEG is known for its ability to capture the rapid dynamics of specific brain regions precisely.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (5)

5 Limitations

Despite Du-IN’s enhancements in speech decoding via discrete codebook-guided mask modeling, it is still restricted to close-set speech decoding tasks (i.e., the word set only includes 61 pre-determined words). However, a parallel to our work, Feng et al. feng2023high build an acoustic-inspired model that can decode arbitrary Chinese words by predicting syllable components (initials, finals, tones). Although their method requires a large amount of labeled data, their experimental design mirrors ours closely. The difference lies in the requirement for the subject to repeat syllable components, instead of entire words. Therefore, with slight modifications, our model can support open-set speech decoding tasks.

Additionally, the experiments in this paper are restricted to the vocal production part of language decoding, i.e. speech decoding. A more interesting but difficult task is to decode language from the semantic level, in which large language models have been wildly used to improve the model performance tang2023semantic ; duan2023dewave . However, due to the locality of sEEG recordings, it is still under exploration whether sEEG recordings can fully capture semantic-related information across brain regions.

6 Conclusion

This paper proposes Du-IN, a framework for speech decoding, which learns contextual embeddings through discrete codebook-guided mask modeling on specific brain regions. To evaluate our model, we collect a well-annotated Chinese word-reading sEEG dataset to address the lack of sEEG language dataset. Inspired by neuroscientific findings, we analyze the effective brain regions for speech decoding and achieve the best decoding performance with about one electrode in specific brain regions, which dovetails with the past neuroscientific research on language. Comprehensive experiments demonstrate that our model outperforms both supervised and sEEG-based self-supervised baselines, effectively capturing the intricate processing within specific brain regions. It marks a promising neuro-inspired AI approach in BCI. In the end, we hope our work can have implications for future developments in sEEG-based self-supervised models with more consideration over how to build the basic representation units so that the model can maximally benefit from the pre-training stage.

Ethics Statement

Experiments that contribute to this work were approved by IRB. All subjects consent to participate. All electrode locations are exclusively dictated by clinical considerations.

Reproducibility Statement

Code to train models and reproduce the results is submitted as part of the supplementary materials and can be accessed here: TODO, including a demo dataset of 3 subjects for downstream fine-tuning.

References

(1)Miguel Angrick, MaartenC Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sophocles Goulis, Jeremy Saal, AlbertJ Colon, Louis Wagner, DeanJ Krusienski, etal.Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity.Communications biology, 4(1):1055, 2021.
(2)JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization.arXiv preprint arXiv:1607.06450, 2016.
(3)Hangbo Bao, LiDong, Songhao Piao, and Furu Wei.Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021.
(4)JeffreyR Binder, JulieA Frost, ThomasA Hammeke, RobertW Cox, StephenM Rao, and Thomas Prieto.Human brain language areas identified by functional magnetic resonance imaging.Journal of Neuroscience, 17(1):353–362, 1997.
(5)KristoferE Bouchard, Nima Mesgarani, Keith Johnson, and EdwardF Chang.Functional organization of human sensorimotor cortex for speech articulation.Nature, 495(7441):327–332, 2013.
(6)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
(7)Josh Chartier, GopalaK Anumanchipalli, Keith Johnson, and EdwardF Chang.Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex.Neuron, 98(5):1042–1054, 2018.
(8)Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.Generative pretraining from pixels.In International conference on machine learning, pages 1691–1703. PMLR, 2020.
(9)Yuqi Chen, Kan Ren, Kaitao Song, Yansen Wang, Yifan Wang, Dongsheng Li, and Lili Qiu.Eegformer: Towards transferable and interpretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278, 2024.
(10)CheolJun Cho, Edward Chang, and Gopala Anumanchipalli.Neural latent aligner: cross-trial alignment for learning representations of complex, naturalistic neural data.In International Conference on Machine Learning, pages 5661–5676. PMLR, 2023.
(11)Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, AndreasPeter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, etal.Scaling vision transformers to 22 billion parameters.In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
(12)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
(13)BenjaminK Dichter, JonathanD Breshears, MatthewK Leonard, and EdwardF Chang.The control of vocal pitch in human laryngeal motor cortex.Cell, 174(1):21–31, 2018.
(14)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
(15)Yiqun Duan, Jinzhao Zhou, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin.Dewave: Discrete eeg waves encoding for brain dynamics to text translation.arXiv preprint arXiv:2309.14030, 2023.
(16)Chen Feng, LuCao, DiWu, EnZhang, Ting Wang, Xiaowei Jiang, Heng Ding, Chenhao Zhou, Jinbo Chen, Hui Wu, etal.A high-performance brain-sentence communication designed for logosyllabic language.bioRxiv, pages 2023–11, 2023.
(17)Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.Masked autoencoders are scalable vision learners.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
(18)Sergey Ioffe and Christian Szegedy.Batch normalization: Accelerating deep network training by reducing internal covariate shift.In International conference on machine learning, pages 448–456. pmlr, 2015.
(19)Weibang Jiang, Liming Zhao, and Bao liang Lu.Large brain model for learning generic representations with tremendous EEG data in BCI.In The Twelfth International Conference on Learning Representations, 2024.
(20)DiederikP Kingma and Max Welling.Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013.
(21)Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz.Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data.Frontiers in Human Neuroscience, 15:653659, 2021.
(22)Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
(23)Guangye Li, Shize Jiang, SivyllaE Paraskevopoulou, Meng Wang, Yang Xu, Zehan Wu, Liang Chen, Dingguo Zhang, and Gerwin Schalk.Optimal referencing for stereo-electroencephalographic (seeg) recordings.NeuroImage, 183:327–335, 2018.
(24)Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long.itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625, 2023.
(25)Yuchen Liu and Ziyu Jia.Bstt: A bayesian spatial-temporal transformer for sleep staging.In The Eleventh International Conference on Learning Representations, 2022.
(26)SeanL Metzger, KayloT Littlejohn, AlexanderB Silva, DavidA Moses, MargaretP Seaton, Ran Wang, MaximilianE Dougherty, JessieR Liu, Peter Wu, MichaelA Berger, etal.A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023.
(27)DavidA Moses, SeanL Metzger, JessieR Liu, GopalaK Anumanchipalli, JosephG Makin, PengfeiF Sun, Josh Chartier, MaximilianE Dougherty, PatriciaM Liu, GaryM Abrams, etal.Neuroprosthesis for decoding speech in a paralyzed person with anarthria.New England Journal of Medicine, 385(3):217–227, 2021.
(28)Zhiliang Peng, LiDong, Hangbo Bao, Qixiang Ye, and Furu Wei.BEiT v2: Masked image modeling with vector-quantized visual tokenizers.2022.
(29)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, etal.Improving language understanding by generative pre-training.2018.
(30)Ali Razavi, Aaron Vanden Oord, and Oriol Vinyals.Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019.
(31)BlakeA Richards, TimothyP Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, RuiPonte Costa, Archy deBerker, Surya Ganguli, etal.A deep learning framework for neuroscience.Nature neuroscience, 22(11):1761–1770, 2019.
(32)Andrew Saxe, Stephanie Nelli, and Christopher Summerfield.If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021.
(33)Jingwei Sheng, LiZheng, Bingjiang Lyu, Zhehang Cen, Lang Qin, LiHai Tan, Ming-Xiong Huang, Nai Ding, and Jia-Hong Gao.The cortical maps of hierarchical linguistic structures during speech perception.Cerebral cortex, 29(8):3232–3240, 2019.
(34)Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao.Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2022.
(35)Jerry Tang, Amanda LeBel, Shailee Jain, and AlexanderG Huth.Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, 26(5):858–866, 2023.
(36)Aaron Van DenOord, Oriol Vinyals, etal.Neural discrete representation learning.Advances in neural information processing systems, 30, 2017.
(37)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
(38)Christopher Wang, Vighnesh Subramaniam, AdamUri Yaari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu.Brainbert: Self-supervised representation learning for intracranial recordings.arXiv preprint arXiv:2302.14367, 2023.
(39)Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long.Timesnet: Temporal 2d-variation modeling for general time series analysis.In The eleventh international conference on learning representations, 2022.
(40)KeYi, Yansen Wang, Kan Ren, and Dongsheng Li.Learning topology-agnostic eeg representations with geometry-aware modeling.Advances in Neural Information Processing Systems, 36, 2024.
(41)Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li.Brant: Foundation model for intracranial neural signal.Advances in Neural Information Processing Systems, 36, 2024.
(42)Hui Zheng, Zhongtao Chen, Haiteng Wang, Jianyang Zhou, Lin Zheng, and Yunzhe Liu.Universal sleep decoder: Aligning awake and sleep neural representation across subjects.arXiv preprint arXiv:2309.16457, 2023.

Appendix A Experiment Design

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (6)

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. [27] to collect a well-annotated Chinese word-reading sEEG dataset, including 12 subjects (9 male, 3 female; aged 15-53, $\mu$ 27.8, $\sigma$ 10.4) with pharmacologically intractable epilepsy.

In the word-reading task, the subject speaks aloud individual words from a 61-word set while we simultaneously record his brain activity (measured by sEEG) and voice. The word set is chosen based on the following criteria:

•
The versatility of the words in generating a range of sentences.
•
The simplicity of using the words to express fundamental caregiving requirements.
•
The diversity of word pronunciations to cover as many Chinese pronunciation combinations as possible.

A list of the words contained in this 61-word set is provided in Table 4.

All data are collected as a series of "blocks" (25 blocks in total), with each block lasting about 10 minutes and consisting of multiple trials. During each block of this task, all words (from the 61-word set) are presented individually twice, leading to a total of 122 trials.

Each trial in a block of this task starts with one word shown on the screen in white text. After 0.5 seconds, the text will turn green and remain on the screen for 2 seconds. This color transition from white to green represents the go cue for each trial, and the subject is instructed to speak the word aloud as soon as the text turns green. Afterward, the text will be replaced with a blank screen with a centered cross. After 0.5 seconds, the task continues to the next trial. The word presentation order is randomized within each task block.

Besides, we also collected non-task recordings of subjects in their daily life. Apart from sleep periods, there are roughly 12 hours of non-task recordings during wakefulness. In summary, for each subject, we collect about 15 hours of sEEG recordings, of which 3 hours are task recordings.

Words	Translations	Words	Translations	Words	Translations
{CJK}UTF8gbsn嘴巴	mouth	{CJK}UTF8gbsn菠萝	pineapple	{CJK}UTF8gbsn帮助	help
{CJK}UTF8gbsn把	get	{CJK}UTF8gbsn朋友	friend	{CJK}UTF8gbsn脸盆	washbasin
{CJK}UTF8gbsn平静	calm	{CJK}UTF8gbsn漂亮	pretty	{CJK}UTF8gbsn衣服	clothes
{CJK}UTF8gbsn豆腐	tofu	{CJK}UTF8gbsn米饭	rice	{CJK}UTF8gbsn放在	put on
{CJK}UTF8gbsn面条	noodle	{CJK}UTF8gbsn毛巾	towel	{CJK}UTF8gbsn关门	close the door
{CJK}UTF8gbsn电脑	computer	{CJK}UTF8gbsn凳子	stool	{CJK}UTF8gbsn小刀	knife
{CJK}UTF8gbsn头疼	headache	{CJK}UTF8gbsn软糖	gummies	{CJK}UTF8gbsn醋	vinegar
{CJK}UTF8gbsn青菜	vegetables	{CJK}UTF8gbsn厕所	toilet	{CJK}UTF8gbsn葱花	chopped green onion
{CJK}UTF8gbsn手机	cell phone	{CJK}UTF8gbsn篮球	basketball	{CJK}UTF8gbsn钢琴	piano
{CJK}UTF8gbsn心情	mood	{CJK}UTF8gbsn丝瓜	loofah	{CJK}UTF8gbsn蒜泥	garlic paste
{CJK}UTF8gbsn怎样	how	{CJK}UTF8gbsn香肠	sausage	{CJK}UTF8gbsn需要	need
{CJK}UTF8gbsn你	you	{CJK}UTF8gbsn拿	hold	{CJK}UTF8gbsn橙汁	orange juice
{CJK}UTF8gbsn找	look for	{CJK}UTF8gbsn猪肉	pork	{CJK}UTF8gbsn吃	eat
{CJK}UTF8gbsn穿	wear	{CJK}UTF8gbsn是	be	{CJK}UTF8gbsn家人	family
{CJK}UTF8gbsn热水	hot water	{CJK}UTF8gbsn护士	nurse	{CJK}UTF8gbsn换药	change dressing
{CJK}UTF8gbsn喝	drink	{CJK}UTF8gbsn口渴	thirsty	{CJK}UTF8gbsn看	look
{CJK}UTF8gbsn碗	bowl	{CJK}UTF8gbsn鱼块	steak	{CJK}UTF8gbsn感觉	feel
{CJK}UTF8gbsn给	give	{CJK}UTF8gbsn玩	play	{CJK}UTF8gbsn问题	problem
{CJK}UTF8gbsn外卖	takeouts	{CJK}UTF8gbsn有	have	{CJK}UTF8gbsn音乐	music
{CJK}UTF8gbsn预约	reserve	{CJK}UTF8gbsn汤圆	sweet dumpling	{CJK}UTF8gbsn愿意	willing
{CJK}UTF8gbsn我	I	-	-	-	-

Appendix B Details of Baselines

In experiments, we compare our model to the existing supervised or self-supervised methods on brain signals. The details of these baseline models are given here:

•
EEG-Conformer[34]: A supervised model that consists of both CNN module and Transformer module, to encapsulate local and global features in a unified EEG classification framework. EEG-Conformer is mainly designed for EEG-based motor imagination tasks. Since the data modes of EEG and sEEG are similar, and the signals primarily pertain to vocal production, EEG-Conformer is suitable to serve as a baseline for comparison.
•
CNN-BiGRU[27]: A supervised model that consists of both CNN module and Bi-GRU module, to capture contextual features from EEG signals. This model is mainly designed for ECoG-based vocal production tasks, similar to ours. Since ECoG and sEEG are both intracranial neural signals of the brain, this model is suitable to serve as a baseline for comparison.
•
BrainBERT[38]: A self-supervised model for sEEG recordings that bridges modern representation learning approaches to neuroscience. BrainBERT builds universal representation based on the superlet spectrograms of one single sEEG channel without modeling the spatial relationships among channels. Since the downstream tasks for BrainBERT are also related to language decoding (e.g., sentence-onset detection, speech vs. non-speech detection, etc.), BrainBERT is suitable to serve as a baseline for comparison.
•
Brant[41]: A self-supervised model for sEEG recordings that can capture both long-term temporal dependency and spatial correlation from neural signals. Brant is mainly designed for medicine, serving as a sEEG foundation model. Although Brant mainly evaluates its performance on the low-level modeling tasks [39] (e.g., neural signal forecasting, imputation, etc.), Brant achieves SOTA performance on some high-level modeling tasks (e.g., seizure detection). As a foundation model in sEEG pre-training field, Brant is suitable to serve as a baseline for comparison.

When evaluating the decoding performance of these baseline models, we follow the same experiment setup of the Du-IN CLS model:

•
For one subject, we split the downstream dataset into training, validation, and testing splits with a size roughly proportional to 80%, 10%, and 10%.
•
The data samples are 3 seconds with the specified sampling rate corresponding to each model.
•
The samples in the train-set are augmented following the pipeline defined in Appendix D.

For the self-supervised methods, the pre-training setup follows the original setup of each model:

•
For the BrainBERT model, we use around 180 hours of sEEG recordings from all 12 subjects for pre-training. This pre-training dataset is larger than the one (approximately 45 hours) used in the original paper.
•
For the Brant model, we also use all sEEG recordings from 12 subjects to pre-train it. While the total pre-training dataset is smaller than the one (around 2700 hours) used in the original paper, the number of subjects (i.e., the number of sEEG location configurations) is greater than in the original paper.

Appendix C Model Details

C.1 Du-IN VQ-VAE

The architecture of the Du-IN VQ-VAE model contains three parts: (1) Neural Tokenizer, (2) Vector Quantizer, and (3) Neural Regressor. The overall architecture of "Neural Tokenizer" is shown in Figure 2. The "Vector Quantizer" is implemented similarly in LaBraM[19]. The "Neural Regressor" contains:

C.2 Du-IN MAE

The architecture of the Du-IN MAE model contains two parts: (1) Neural Encoder, and (2) Token Prediction Head. The overall architecture of the "Neural Encoder" is shown in Figure 2. The hyperparameters of "Neural Encoder" are the same as those of "Neural Tokenizer" in Du-IN VQ-VAE. It’s worth noting that when training Du-IN MAE, the weights of the "Neural Encoder" are randomly initialized, instead of loaded from the pre-trained Du-IN VQ-VAE model. The hyperparameters for Du-IN MAE training are shown in Table 6.

Module	Sub-Module	Name	Value
Token Prediction Head	-	Linear Projection	$160\rightarrow 2048$
Optimizer	-	Batch Size	64
		Maximum Learning Rate	3e-4
		Minimum Learning Rate	5e-5
		Learning Rate Scheduler	Cosine
		Optimizer Type	AdamW
		Adam $\beta$	$(0.9,0.99)$
		Weight Decay	0.05
		Total Epochs	400
		Warm-up Epochs	40

C.3 Du-IN CLS

The architecture of the Du-IN CLS model contains two parts: (1) Neural Encoder, and (2) Label Prediction Head. The overall architecture of the "Neural Encoder" is shown in Figure 2. The hyperparameters of "Neural Encoder" are the same as those of "Neural Tokenizer" in Du-IN VQ-VAE. It’s worth noting that the "Neural Encoder" weights in Du-IN CLS can be loaded from either the pre-trained Du-IN MAE or the pre-trained Du-IN VQ-VAE. In the ablation experiments shown in Table 2, our models have different suffixes:

•
Du-IN: The original Du-IN CLS model. All weights of this model are randomly initialized.
•
Du-IN (vqvae+vq): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN VQ-VAE. When fine-tuning it on the downstream task, the "Vector Quantizer" in the pre-trained Du-IN VQ-VAE is inserted between "Neural Encoder" and "Label Prediction Head". This is the same operation in DeWave[15].
•
Du-IN (vqvae): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN VQ-VAE. This is the same operation in EEGFormer [9].
•
Du-IN (mae): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN MAE. This is the same operation in LaBraM [19].

The "Label Prediction Head" is an MLP with one hidden layer, flattens the output embedding sequence from upstream, and maps this feature embedding to the final prediction through MLP. The hyperparameters for Du-IN CLS training are shown in Table 7.

Module	Sub-Module	Name	Value
Label Prediction Head	-	Flatten	-
Label Prediction Head	-	Linear Projection	$30\times 160\rightarrow 128(ReLU)\rightarrow 61$
Optimizer	-	Batch Size	32
		Maximum Learning Rate	2e-4
		Minimum Learning Rate	5e-6
		Learning Rate Scheduler	Cosine
		Optimizer Type	AdamW
		Adam $\beta$	$(0.9,0.99)$
		Weight Decay	0.05
		Total Epochs	200
		Warm-up Epochs	20

Appendix D Data Augmentation

To enhance the robustness of learned representations during both the pre-training and fine-tuning stages, we apply data augmentation in both datasets.

Pre-training Dataset.

In our implementation, we segment sEEG recordings into 8-second samples with a 4-second overlap. When fetching a sample, we randomly select a starting point between 0 and 4 seconds, then extract a 4-second sample beginning from that point.

Downstream Dataset.

Since a trial lasts for 3 seconds, employing the jittering mentioned above leads to the blending of information from other trials. In our implementation, we segment sEEG recordings into 3-second samples. When fetching a sample, we randomly choose a shift step between 0 and 0.3 seconds, then shift the sample either to the left or right, padding it with zeros.

Appendix E Du-IN Pre-training Analysis

The pre-training of Du-IN can be interpreted as the training of a variational autoencoder [20, 3]. Let $x$ denote the original sEEG signal, $\tilde{x}$ the corrupted sEEG by mask, and $z$ the neural tokens. Considering the evidence lower bound (ELBO) of the log-likelihood $p(x|\tilde{x})$ , i.e., recovering the original sEEG signal from its corrupted version:

\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\mathrm{log}\ p(x_{i}|\tilde{x}_{i})%\geq\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\Big{(}\underbrace{\mathbb{E}_{z%_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}[\mathrm{log}\ p_{\psi}(x_{i}|z_{i})]%}_{\mathrm{Neural\ Token\ Reconstruction}}-D_{\mathrm{KL}}[q_{\phi}(\bm{%\mathrm{z}}|x_{i}),p_{\theta}(\bm{\mathrm{z}}|\tilde{x}_{i})]\Big{)},

(10)

where (1) $q_{\phi}(z|x)$ denotes the neural tokenizer that obtains neural tokens; (2) $p_{\psi}(x|z)$ decodes the original sEEG signal given input neural tokens; (3) $p_{\theta}(z|\tilde{x})$ recovers the neural tokens based on the masked sEEG signal, which is our Du-IN pre-training task.

The whole framework is optimized through a two-stage procedure as [36, 30]. For the first stage, we train the neural tokenizer as a discrete variational autoencoder by minimizing the reconstruction loss $-\mathbb{E}_{z_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}\mathrm{log}\ p_{\psi}(%\tilde{x}_{i}|z_{i})$ with a uniform prior. For the second stage, we set $q_{\phi}$ as well as $p_{\psi}$ fixed and learn the prior $p_{\theta}$ by minimizing the loss $D_{\mathrm{KL}}$ . For simplicity, $q_{\phi}(\bm{\mathrm{z}}|x_{i})$ is defined as a one-point distribution with the most likely neural tokens $\hat{z}_{i}=\mathop{\arg\max}\limits_{z}q_{\phi}(\bm{\mathrm{z}}|x_{i})$ . Consequently, we can rewrite Equation 10 as

\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\mathrm{log}\ p(x_{i}|\tilde{x}_{i})%\geq\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\Big{(}\underbrace{\mathbb{E}_{z%_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}[\mathrm{log}\ p_{\psi}(\tilde{x}_{i}%|z_{i})]}_{\mathrm{Neural\ Token\ Reconstruction}}+\underbrace{\mathrm{log}\ p%_{\theta}(\hat{z}_{i}|\tilde{x}_{i})}_{\mathrm{Masked\ sEEG\ Modeling}}\Big{)},

(11)

where the first term is the objective for vector-quantized neural signal regression in the first stage (i.e., the Du-IN VQ-VAE model), and the second term is the objective for Du-IN pre-training in the second stage (i.e., the Du-IN MAE model).

Appendix F Visualization of Vector-Quantized sEEG Regression

We further visualize how the sEEG signals are reconstructed. As depicted in Figure 7, although some details are missing, the overall trend of the signals is reconstructed well. Meanwhile, there is a stable decrease in the reconstruction loss during training, which indicates the discrete codebook does learn high-level information from sEEG signals.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (7)

Appendix G Visualization of Mask sEEG Modeling

Figure 8 demonstrates the convergence curves of the total pre-training loss and masked sEEG modeling accuracy of the Du-IN MAE model. We observe that there is a stable decrease in the mask modeling loss, and the mask modeling accuracy achieves about 20%, which is similar to [19].

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (8)

Appendix H Channel Contribution Analysis

For each subject, after training the Du-IN model (with random initialization) on the downstream dataset, we utilize the weights $W\in\mathbb{R}^{C\times D}$ of linear projection in the spatial encoder to calculate the contribution scores $\mathcal{S}$ of channels:

\mathcal{S}=\{s_{i}|i=1,...,C\},\quad s_{i}=\frac{1}{D}\sum_{j=1}^{D}|W_{ij}|,

(12)

where $C$ is the number of channels, $D$ is the dimension of projected embedding and $|\cdot|$ gets the absolute value. Then, we normalize $\mathcal{S}$ using its maximum value to ensure it falls within the [0,1] range. Finally, given the variability in model performance across subjects, we further adjust the channel contribution scores based on the decoding performance of that subject, i.e., $\mathcal{S}=\{s_{i}\cdot p|i=1,...,C\}$ , where $p$ represents the decoding performance of that subject.

After calculating the channel contribution scores of all subjects, we project them to the standard brain template according to the MNI (Montreal Neurological Institute) locations of channels, using Nilearn 0.9.2. Since the electrodes are sparsely distributed within the brain, we use Scipy 1.8.1 to interpolate and smooth the channel contribution matrix and use NiLearn to plot the channel contribution map demonstrated in Figure 4(a).

With the sorted channels within each subject, we evaluate the effect of the number of channels on the decoding performance. For each subject, we evaluate the Du-IN model with $\{5,10,15,20,30,60\}$ channels (sorted by channel contribution scores), and the averaged performance (across subjects) is demonstrated in Figure 4(b).

Appendix I Subject-Wise Evaluation

The detailed performance of different methods from each subject is provided in Table 8, Table 9, and Table 10, with the best in bold and the second underlined. For model comparison, we report the average and standard deviation values (within each subject) on six different random seeds to obtain comparable results. "Std" means standard deviation.

Methods	Pre-trained	Accuracy (%) $\pm$ Std (%)
Methods	Pre-trained	subj-01	subj-02	subj-03	subj-04
EEG-Conformer[34]	✗	58.41 $\pm$ 1.03	69.82 $\pm$ 1.22	19.50 $\pm$ 1.71	49.65 $\pm$ 2.38
CNN-BiGRU[27]	✗	46.46 $\pm$ 4.03	68.06 $\pm$ 1.56	4.35 $\pm$ 0.44	17.68 $\pm$ 3.88
BrainBERT[38]	✓	6.76 $\pm$ 0.64	25.64 $\pm$ 1.23	2.97 $\pm$ 0.66	5.09 $\pm$ 0.44
Brant[41]	✓	7.47 $\pm$ 2.83	54.26 $\pm$ 1.63	3.34 $\pm$ 0.38	4.15 $\pm$ 1.35
Du-IN	✗	71.25 $\pm$ 1.44	77.99 $\pm$ 0.87	23.04 $\pm$ 4.76	59.91 $\pm$ 4.58
Du-IN (vqvae+vq)	✓	50.15 $\pm$ 3.80	62.79 $\pm$ 4.67	20.72 $\pm$ 2.15	48.24 $\pm$ 2.65
Du-IN (vqvae)	✓	72.36 $\pm$ 1.55	79.16 $\pm$ 1.12	29.21 $\pm$ 2.38	63.83 $\pm$ 1.83
Du-IN (mae)	✓	78.60 $\pm$ 0.79	83.61 $\pm$ 0.38	38.80 $\pm$ 2.52	70.98 $\pm$ 0.81

Methods	Pre-trained	Accuracy (%) $\pm$ Std (%)
Methods	Pre-trained	subj-05	subj-06	subj-07	subj-08
EEG-Conformer[34]	✗	65.44 $\pm$ 1.31	31.06 $\pm$ 2.58	47.89 $\pm$ 1.86	42.12 $\pm$ 2.08
CNN-BiGRU[27]	✗	51.26 $\pm$ 4.93	31.52 $\pm$ 1.48	47.75 $\pm$ 1.12	24.64 $\pm$ 4.44
BrainBERT[38]	✓	11.28 $\pm$ 1.17	3.00 $\pm$ 0.47	5.31 $\pm$ 0.55	5.22 $\pm$ 0.76
Brant[41]	✓	28.83 $\pm$ 1.93	5.28 $\pm$ 1.48	9.70 $\pm$ 1.85	8.93 $\pm$ 1.39
Du-IN	✗	77.60 $\pm$ 1.20	41.91 $\pm$ 1.80	59.63 $\pm$ 2.20	52.35 $\pm$ 2.18
Du-IN (vqvae+vq)	✓	63.46 $\pm$ 2.28	34.84 $\pm$ 1.98	45.20 $\pm$ 2.44	40.14 $\pm$ 1.77
Du-IN (vqvae)	✓	78.56 $\pm$ 1.24	43.29 $\pm$ 1.67	62.29 $\pm$ 1.49	54.10 $\pm$ 1.34
Du-IN (mae)	✓	81.56 $\pm$ 1.11	46.90 $\pm$ 1.02	65.45 $\pm$ 1.74	59.09 $\pm$ 0.98

Methods	Pre-trained	Accuracy (%) $\pm$ Std (%)
Methods	Pre-trained	subj-09	subj-10	subj-11	subj-12
EEG-Conformer[34]	✗	56.51 $\pm$ 1.98	22.22 $\pm$ 1.07	57.10 $\pm$ 2.03	29.87 $\pm$ 1.44
CNN-BiGRU[27]	✗	44.03 $\pm$ 5.88	7.11 $\pm$ 0.71	28.44 $\pm$ 3.42	13.17 $\pm$ 3.41
BrainBERT[38]	✓	7.20 $\pm$ 1.37	2.49 $\pm$ 0.43	10.60 $\pm$ 1.22	4.41 $\pm$ 0.88
Brant[41]	✓	6.46 $\pm$ 1.66	3.00 $\pm$ 0.31	9.82 $\pm$ 1.71	7.82 $\pm$ 1.66
Du-IN	✗	66.39 $\pm$ 0.47	27.07 $\pm$ 2.24	73.56 $\pm$ 1.09	44.76 $\pm$ 3.74
Du-IN (vqvae+vq)	✓	60.06 $\pm$ 1.61	22.05 $\pm$ 1.76	50.31 $\pm$ 4.69	32.06 $\pm$ 3.28
Du-IN (vqvae)	✓	67.18 $\pm$ 1.22	31.06 $\pm$ 1.59	72.41 $\pm$ 1.98	45.38 $\pm$ 2.26
Du-IN (mae)	✓	69.18 $\pm$ 1.96	34.23 $\pm$ 1.17	75.52 $\pm$ 1.27	48.54 $\pm$ 0.56

Appendix J Subject-Wise Electrode Locations

We provide detailed information on the locations of the implanted sEEG electrodes for each subject. Red channels are the top 10 channels (selected through channel contribution analysis) for both pre-training and downstream evaluation, as described in Section 4.3. As the majority of subjects have sEEG electrodes implanted on only one side of their brains to locate the source of epilepsy, we provide side views of either the left or right brain areas here.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (9)

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (10)

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (11)

NeurIPS Paper Checklist

1.
Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
Answer: [Yes]
Justification: Three of the four contributions mentioned at the end of the "Introduction" section are explicitly included. The contribution related to neuroscience-inspired analysis is simplified as "inspired by neuroscience findings" at the end of the "Abstraction" section.
2.
Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes]
Justification: We provide a separate "Limitations" section; see Section 5 for more details.
3.
Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Answer: [N/A]
Justification: The paper does not include theoretical results.
4.
Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
Answer: [Yes]
Justification: We provide detailed information related to our model and baselines in Appendix C and Appendix B, respectively. Besides, we provide code and demo dataset in Section 6.
5.
Open access to data and code
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes]
Justification: We provide code and demo dataset in Section 6. Due to the lack of open-source sEEG datasets related to language, we collected a well-annotated Chinese word-reading sEEG dataset, and evaluated our model on this dataset. However, respecting the efforts of the data collectors, we only provide the dataset of some subjects to reproduce the experimental results in the main text. The whole dataset will be publicly available to ensure the reproducibility of this work.
6.
Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes]
Justification: We provide detailed information related to our model and baselines in Appendix C and Appendix B, respectively.
7.
Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes]
Justification: For the main results, we report the average and standard error values (of all subjects) on six random seeds. For detailed subject-wise evaluation, we report the average and standard deviation values (of each subject) on six random seeds.
8.
Experiments Compute Resources
Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
Answer: [Yes]
Justification: Detailed information related to the training process is provided in Section 4.2
9.
Code Of Ethics
Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
Answer: [Yes]
Justification: The research conducted in the paper conforms, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines.
10.
Broader Impacts
Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
Answer: [No]
Justification: This work aims to explore the feasibility of intracranial neural signals to decode speech, which mainly has positive impacts on society.
11.
Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
Answer: [No]
Justification: The paper poses no such risks.
12.
Licenses for existing assets
Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: The creators or original owners of assets (e.g., code, data, models), used in the paper, are properly credited and are the license and terms of use explicitly mentioned and properly respected.
13.
New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [N/A]
Justification: The paper does not release new assets.
14.
Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [Yes]
Justification: The ethics statements are provided in Section 6.
15.
Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [Yes]
Justification: The ethics statements are provided in Section 6.

Module	Sub-Module	Name	Value
Neural Tokenizer(Neural Transformer)	Spatial Encoder	Linear Projection	$10\rightarrow 16$
		# of Input Channels	{16,128,128}
		# of Output Channels	{128,128,16}
		Kernel Size	{19,3,3}
		Stride	{10,1,1}
		Padding	{9,1,1}
	Transformer Encoder	# of Transformer Layers	8
		Hidden Size	160
		MLP Size	320
		MLP Dropout Ratio	{0.2,0.}
		# of Attention Heads	8
		Attention Head Size	64
		Attention Dropout Ratio	0.2
Vector Quantizer	-	Codex Size	$2048\times 64$
		Embedding-to-Codex Projection	$160\rightarrow 160(Tanh)\rightarrow 64$
		Codex-to-Embedding Projection	$64\rightarrow 160$
Neural Regressor	Transformer Decoder	# of Transformer Layers	4
		Hidden Size	160
		MLP Size	320
		MLP Dropout Ratio	{0.2,0.}
		# of Attention Heads	8
		Attention Head Size	64
		Attention Dropout Ratio	0.2
	Time Regression Head	# of Input Channels	{160,128,128,128,128}
		# of Output Channels	{128,128,128,128,16}
		Kernel Size	{3,3,10,9,19}
		Stride	{1,1,10,1,10}
		Padding	-
		Output Padding	-
		Linear Projection	$16\rightarrow 10$
Optimizer	-	Batch Size	64
		Maximum Learning Rate	3e-4
		Minimum Learning Rate	5e-5
		Learning Rate Scheduler	Cosine
		Optimizer Type	AdamW
		Adam $\beta$	$(0.9,0.99)$
		Weight Decay	0.01
		Total Epochs	400
		Warm-up Epochs	40