Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2024)

Hui Zheng*,2,4, Hai-Teng Wang*,1, Wei-Bang Jiang3, Zhong-Tao Chen1,
Li He1, Pei-Yang Lin1, Peng-Hu Wei5, Guo-Guang Zhao5, Yun-Zhe Liu†,1,4
1
Beijing Normal University, 2Peking University, 3Shanghai Jiao Tong University,
4Chinese Institute for Brain Research, 5Capital Medical University, Xuanwu Hospital, Beijing
*Equal contribution,†yunzhe.liu@bnu.edu.cn

Abstract

Invasive brain-computer interfaces have garnered significant attention due to their high performance. The current intracranial stereoElectroEncephaloGraphy (sEEG) foundation models typically build univariate representations based on a single channel. Some of them further use Transformer to model the relationship among channels. However, due to the locality and specificity of brain computation, their performance on more difficult tasks, e.g., speech decoding, which demands intricate processing in specific brain regions, is yet to be fully investigated. We hypothesize that building multi-variate representations within certain brain regions can better capture the specific neural processing. To explore this hypothesis, we collect a well-annotated Chinese word-reading sEEG dataset, targeting language-related brain networks, over 12 subjects. Leveraging this benchmark dataset, we developed the Du-IN111Du-IN refers to the phonetic transcription of ”{CJK}UTF8bsmi讀音” (i.e., pronunciation) in Chinese. model that can extract contextual embeddings from specific brain regions through discrete codebook-guided mask modeling. Our model achieves SOTA performance on the downstream 61-word classification task, surpassing all baseline models. Model comparison and ablation analysis reveal that our design choices, including (i) multi-variate representation by fusing channels in vSMC and STG regions and (ii) self-supervision by discrete codebook-guided mask modeling, significantly contribute to these performances. Collectively, our approach, inspired by neuroscience findings, capitalizing on multi-variate neural representation from specific brain regions, is suitable for invasive brain modeling. It marks a promising neuro-inspired AI approach in BCI.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (1)

1 Introduction

Brain signals refer to the biometric information collected from the brain. Their patterns provide valuable insights toward understanding the physiological functions of the brain and the mechanism of related diseases, leading to various applications, including speech decoding cho2023neural ; duan2023dewave ; moses2021neuroprosthesis , sleep cognition research liu2022bstt ; zheng2023universal , neurological disorders detection jiang2024large ; zhang2024brant , and so on. Due to the high signal-noise ratio, invasive recording methods (e.g., stereoElectroEncephaloGraphy (sEEG), ElectroCorticoGraphy (ECoG)) usually reveal these underlying mechanisms better than non-invasive recording methods. Compared with ECoG, sEEG imposes less trauma on patients and provides more stereotactic information from specific brain regions. Although some researchers moses2021neuroprosthesis ; metzger2023high have recently built high-performance speech decoders based on ECoG, there are few attempts made to build speech decoders based on sEEG.

Modeling intracranial neural signals, especially sEEG, has drawn much research attention, but several issues remain unresolved. Current research on modeling intracranial neural signals is predominantly divided into two lines according to the basic modeling unit (i.e., a single channel or a group of channels). Some studies wang2023brainbert ; zhang2024brant embed a single channel to build univariate representations, focusing primarily on downstream tasks that involve single-channel prediction, e.g., epilepsy detection. However, they haven’t validated their methods on more challenging tasks requiring integrating multiple channels for prediction, e.g., speech decoding. Other studies angrick2021real ; feng2023high fuse a group of channels to build multi-variate representations, mostly relying on fully supervised methods that heavily rely on labeled data. Nonetheless, labeling data at scale in medical experiments is often impractical or costly, emphasizing the need to maximize label efficiency. To overcome these limitations, self-supervised pre-training followed by fine-tuning can leverage abundant unlabeled data, enhancing performance on various downstream tasks.

The primary challenge in modeling intracranial neural signals lies in extracting meaningful tokens, requiring careful consideration of two key factors. (1) Temporal scale. Since intracranial neural signals have high temporal resolution and signal-noise ratio, these tokens must capture rapid dynamic changes in brain activity. (2) Spatial scale. Since the functions of different brain regions are relatively specific, these tokens should correctly capture the information of each brain region for further integration. To better assess how well different models capture the intricate processing within each brain region, we can evaluate these methods on tasks mainly involving a few brain regions.

Since speech mainly involves specific brain regions related to vocal production, as demonstrated in Section 2.1, we utilize speech decoding tasks to evaluate which model can effectively extract information from specific brain regions. Due to the lack of an open-source sEEG language dataset, we collected a well-annotated Chinese word-reading sEEG dataset (vocal production), including 12 subjects, which makes up for the problem of missing sEEG recordings in language tasks. Inspired by neuroscientific findings, we systematically demonstrate the locality and specificity of brain computation and propose the Du-IN model to solve the abovementioned issues. Compared to other existing methods for modeling brain signals, Du-IN achieves SOTA performance on the 61-word classification task, demonstrating the effectiveness of our model in extracting meaningful tokens that can capture both the rapid changes and the precise state of specific brain regions. It marks a promising neuro-inspired AI approach saxe2021if ; richards2019deep in BCI.

To sum up, the main contributions of our work comprise:

  1. 1.

    A well-annotated Chinese word-reading sEEG dataset, addressing the lack of sEEG language dataset. The dataset will be publicly available.

  2. 2.

    Demonstration of brain-specific computation – achieving the best decoding performance only requires about one electrode in specific brain regions (i.e., vSMC, STG).

  3. 3.

    A novel framework for sEEG speech decoding – Du-IN, which learns multi-variate contextual embeddings through discrete codebook-guided mask modeling.

  4. 4.

    SOTA performance on the sEEG speech decoding task – Du-IN achieves 62.70% top-1 accuracy on the 61-word classification task, surpassing all other baselines.

2 Related Works

2.1 Neural Basis of Language Function

Neuroscientific research bouchard2013functional ; dichter2018control ; sheng2019cortical in the past has extensively explored brain regions supporting language functionality. In neuroscience, the investigation into language functionality related to speech has been categorized into two main streams: one dedicated to semantic processing and the other to vocal production. Previous studies binder1997human ; sheng2019cortical have shown that brain regions associated with semantic processing primarily include left inferior frontal gyrus (IFG), left anterior temporal lobe (ATL), and bilateral middle temporal gyrus (MTG).

As for vocal production, which is also the focus of our work, it is predominantly governed by motor information related to language articulation, primarily involving ventral sensorimotor cortex (vSMC), bilateral superior temporal gyrus (STG), and bilateral dorsal laryngeal motor cortex (dLMC) bouchard2013functional ; dichter2018control ; chartier2018encoding . Our analysis results based on our collected word-reading sEEG dataset also confirm this point, as illustrated in Figure 4.

2.2 Language Decoding in BCI

The keys to decoding natural language from brain signals are (1) high-quality recordings, and (2) well-designed models with good representations. Compared to non-invasive recordings (e.g., EEG), invasive recordings manifest advantages in providing detailed information about specific brain regions with a high signal-noise ratio. Since speech mainly involves some specific brain regions, obtaining detailed recordings of these brain regions will significantly enhance the decoding performance. Existing works cho2023neural ; moses2021neuroprosthesis ; feng2023high have shown the great potential of building a high-performance decoder based on invasive recordings.

The other key is well-designed models with good representations. Existing work for brain-to-language representations can be classified into two categories: self-supervision or alignment with representation models pre-trained on other modalities (e.g., text, audio). BrainBERT wang2023brainbert learns general embeddings through self-supervised mask modeling. DeWave duan2023dewave introduces discrete codex encoding and aligns neural representations with text embeddings from BART lewis2019bart , thus enhancing the extraction of semantic processing-related information from EEG recordings. Metzger1 et al. metzger2023high align neural representations with acoustic embeddings to improve the extraction of vocal production-related information from ECoG recordings.

2.3 Self-supervised Learning in BCI

In recent years, self-supervised pre-training has made significant progress in natural language processing devlin2018bert ; radford2018improving ; brown2020language and computer vision bao2021beit ; he2022masked ; chen2020generative . However, its potential in BCI is far from being explored. BrainBERT (for sEEG) wang2023brainbert builds univariate representations based on a single channel and utilizes mask modeling to learn general representations. Brant (for sEEG) zhang2024brant and LaBraM (for EEG) jiang2024large also build univariate representations but further consider the spatial correlation among channels. BENDR (for EEG) kostas2021bendr takes the other way – building multi-variate representations by fusing all channels, and it uses contrastive learning to learn contextual representations. MMM (for EEG) yi2024learning also builds multi-variate representations but further considers the difference among brain regions (i.e., splitting channels into different groups).

All existing pre-train methods for sEEG primarily construct univariate representations and some of them further employ Transformer models to capture relationships among channels. However, unlike EEG pre-training methods, their effectiveness over multivariate representations has not been experimentally proven. Besides, there is no standard channel configuration for sEEG recordings, unlike EEG recordings, which makes modeling spatial relationships in sEEG more challenging.

3 Method

The overall architecture of Du-IN is illustrated in Figure 2, where the raw sEEG signals are fused across channels to build multi-variate representations and further encoded for downstream tasks.

3.1 Task Definition

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. moses2021neuroprosthesis to collect a well-annotated Chinese word-reading sEEG dataset (vocal production). During the experiment, each subject speaks aloud 61 pre-determined Chinese words 50 times; see Appendix A for more details. We formulate the multi-channel sEEG signals as 𝒳C×T𝒳superscript𝐶𝑇\mathcal{X}\in\mathbb{R}^{C\times T}caligraphic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_T end_POSTSUPERSCRIPT, where C𝐶Citalic_C is the number of sEEG channels and T𝑇Titalic_T is the total timestamps. The associated word label is denoted as 𝒚𝒴𝒚𝒴\bm{y}\in\mathcal{Y}bold_italic_y ∈ caligraphic_Y, where 𝒴𝒴\mathcal{Y}caligraphic_Y represents the set of 61 pre-determined words. In summary, this dataset comprises paired sEEG-word data (𝒳,𝒚𝒳𝒚\langle\mathcal{X},\bm{y}\rangle⟨ caligraphic_X , bold_italic_y ⟩), and the model aims to decode the corresponding word 𝒚𝒚\bm{y}bold_italic_y from a sequence of raw sEEG signals 𝒳𝒳\mathcal{X}caligraphic_X.

3.2 Model Architecture

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2)

We introduce the neural Transformer, a general architecture for sEEG speech decoding tasks that can deal with any input sEEG signals with arbitrary time length, as shown in Figure 2. The key operation for archiving this is segmenting the sEEG signals into patches, inspired by patch embeddings in images dosovitskiy2020image . For each sample 𝒳𝒳\mathcal{X}caligraphic_X, we use a W𝑊Witalic_W-length window without overlap to segment it into patches, obtaining 𝒳={𝒙iC×W|i=1,,N}𝒳conditional-setsubscript𝒙𝑖superscript𝐶𝑊𝑖1𝑁\mathcal{X}=\{\bm{x}_{i}\in\mathbb{R}^{C\times W}|i=1,...,N\}caligraphic_X = { bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_W end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N }, where N=TW𝑁𝑇𝑊N=\lfloor\frac{T}{W}\rflooritalic_N = ⌊ divide start_ARG italic_T end_ARG start_ARG italic_W end_ARG ⌋ is the number of patches.

Spatial Encoder.

As each sEEG patch has multiple channels, it is vital to fuse different channels to extract meaningful features before patch-wise interaction by self-attention. We employ a spatial encoder, which consists of a linear projection and several convolution blocks, to encode each sEEG patch into a patch embedding. The linear projection transforms the raw sEEG signals into the hidden neural space, and its weights are utilized for subsequent analysis. The convolution block is composed of a 1-D convolution layer and a batch normalization layer ioffe2015batch . We denote the output patch embeddings from the spatial encoder as

p={𝒆ipd|i=1,,N},subscript𝑝conditional-setsubscriptsuperscript𝒆𝑝𝑖superscript𝑑𝑖1𝑁\mathcal{E}_{p}=\{\bm{e}^{p}_{i}\in\mathbb{R}^{d}|i=1,...,N\},caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { bold_italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N } ,(1)

where d𝑑ditalic_d is the dimension of the embeddings.

Temporal Embedding.

In order to enable the model to be aware of the temporal information of patch embeddings, we utilize the parameter-free position embeddings introduced in vaswani2017attention , i.e., t={𝒆1t,,𝒆tmaxt}subscript𝑡subscriptsuperscript𝒆𝑡1subscriptsuperscript𝒆𝑡subscript𝑡𝑚𝑎𝑥\mathcal{E}_{t}=\{\bm{e}^{t}_{1},...,\bm{e}^{t}_{t_{max}}\}caligraphic_E start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { bold_italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT }. Note that tmaxsubscript𝑡𝑚𝑎𝑥t_{max}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the hyperparameter determining the maximum number of time patches and tmaxNsubscript𝑡𝑚𝑎𝑥𝑁t_{max}\geq Nitalic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ≥ italic_N. Given one arbitrary patch embedding 𝒆isubscript𝒆𝑖\bm{e}_{i}bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in Equation 1 from the spatial encoder, we add the corresponding temporal embedding to it:

init={𝒆ip+𝒆it|i=1,,N},subscript𝑖𝑛𝑖𝑡conditional-setsubscriptsuperscript𝒆𝑝𝑖subscriptsuperscript𝒆𝑡𝑖𝑖1𝑁\mathcal{E}_{init}=\{\bm{e}^{p}_{i}+\bm{e}^{t}_{i}|i=1,...,N\},caligraphic_E start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT = { bold_italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + bold_italic_e start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N } ,(2)

which forms the input embeddings initsubscript𝑖𝑛𝑖𝑡\mathcal{E}_{init}caligraphic_E start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT for the Transformer Encoder.

Transformer Encoder.

Finally, the sequence of embeddings will be directly fed into the Transformer encoder vaswani2017attention to get the final encoded ={𝒆id|i=1,,N}conditional-setsubscript𝒆𝑖superscript𝑑𝑖1𝑁\mathcal{E}=\{\bm{e}_{i}\in\mathbb{R}^{d}|i=1,...,N\}caligraphic_E = { bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N }. To make the training of the Transformer more stable and efficient, we incorporate some modifications dehghani2023scaling . We add layer normalization to the queries and keys before the dot-product attention mechanism, which avoids over-large values in attention logits:

Attention(Q,K,V)=softmax(LN(Q)LN(K)Tdhead)V,Attention𝑄𝐾𝑉softmaxLN𝑄LNsuperscript𝐾𝑇subscript𝑑𝑒𝑎𝑑𝑉\mathrm{Attention}(Q,K,V)=\mathrm{softmax}(\frac{\mathrm{LN}(Q)\mathrm{LN}(K)^%{T}}{\sqrt{d_{head}}})V,roman_Attention ( italic_Q , italic_K , italic_V ) = roman_softmax ( divide start_ARG roman_LN ( italic_Q ) roman_LN ( italic_K ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V ,(3)

where dheadsubscript𝑑𝑒𝑎𝑑d_{head}italic_d start_POSTSUBSCRIPT italic_h italic_e italic_a italic_d end_POSTSUBSCRIPT is the dimension of attention head and LNLN\mathrm{LN}roman_LN denotes layer normalization ba2016layer . For downstream classification tasks, we flatten the output embeddings followed by a classification head.

3.3 Neural Tokenizer Training

Prior to pre-training DuIR through mask modeling, we need to tokenize the sEEG patches into discrete tokens. We introduce vector-quantized neural signal regression, which is trained by reconstructing the original sEEG signals, as shown in Figure 3. The key components are the neural tokenizer, which encodes the raw sEEG samples into embeddings, and the neural regressor, which reconstructs the original sEEG signals. The idea is basically inspired by VQ-VAE van2017neural , which encodes images into discrete latent embeddings.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (3)
Neural Tokenizer.

We define a neural codebook 𝒞={𝒄i|i=1,,Ncodex}Ncodex×dcodex𝒞conditional-setsubscript𝒄𝑖𝑖1subscript𝑁𝑐𝑜𝑑𝑒𝑥superscriptsubscript𝑁𝑐𝑜𝑑𝑒𝑥subscript𝑑𝑐𝑜𝑑𝑒𝑥\mathcal{C}=\{\bm{c}_{i}|i=1,...,N_{codex}\}\in\mathbb{R}^{N_{codex}\times d_{%codex}}caligraphic_C = { bold_italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, where Ncodexsubscript𝑁𝑐𝑜𝑑𝑒𝑥N_{codex}italic_N start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT is the number of the discrete neural embeddings and dcodexsubscript𝑑𝑐𝑜𝑑𝑒𝑥d_{codex}italic_d start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT is the dimension of each embedding. Given a sEEG sample 𝒳𝒳\mathcal{X}caligraphic_X, the neural tokenizer (i.e., the neural transformer illustrated in Figure 2) first encodes it to embeddings ={𝒆id|i=1,,N}conditional-setsubscript𝒆𝑖superscript𝑑𝑖1𝑁\mathcal{E}=\{\bm{e}_{i}\in\mathbb{R}^{d}|i=1,...,N\}caligraphic_E = { bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N }. After that, we utilize a linear projection 𝐳csubscript𝐳𝑐\bm{\mathrm{z}}_{c}bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT get the mapped embeddings 𝐳c()={𝐳c(𝒆i)dcodex|i=1,,N}subscript𝐳𝑐conditional-setsubscript𝐳𝑐subscript𝒆𝑖superscriptsubscript𝑑𝑐𝑜𝑑𝑒𝑥𝑖1𝑁\bm{\mathrm{z}}_{c}(\mathcal{E})=\{\bm{\mathrm{z}}_{c}(\bm{e}_{i})\in\mathbb{R%}^{d_{codex}}|i=1,...,N\}bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( caligraphic_E ) = { bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT | italic_i = 1 , … , italic_N } in the codebook space. Then, the codebook looks up the nearest neighbor of each embedding 𝐳c(𝒆i)subscript𝐳𝑐subscript𝒆𝑖\bm{\mathrm{z}}_{c}(\bm{e}_{i})bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in the neural codebook 𝒞𝒞\mathcal{C}caligraphic_C. This procedure can be formulated as

𝐳q()={𝐳q(𝒆i)|i=1,,N},𝐳q(𝒆i)=𝒄zi,zi=argminj2(𝐳c(𝒆i))2(𝒄j)2,formulae-sequencesubscript𝐳𝑞conditional-setsubscript𝐳𝑞subscript𝒆𝑖𝑖1𝑁formulae-sequencesubscript𝐳𝑞subscript𝒆𝑖subscript𝒄subscript𝑧𝑖subscript𝑧𝑖subscript𝑗subscriptnormsubscript2subscript𝐳𝑐subscript𝒆𝑖subscript2subscript𝒄𝑗2\bm{\mathrm{z}}_{q}(\mathcal{E})=\{\bm{\mathrm{z}}_{q}(\bm{e}_{i})|i=1,...,N\}%,\quad\bm{\mathrm{z}}_{q}(\bm{e}_{i})=\bm{c}_{z_{i}},\quad z_{i}=\mathop{\arg%\min}\limits_{j}||\ell_{2}(\bm{\mathrm{z}}_{c}(\bm{e}_{i}))-\ell_{2}(\bm{c}_{j%})||_{2},bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( caligraphic_E ) = { bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | italic_i = 1 , … , italic_N } , bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = bold_italic_c start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) - roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( bold_italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,(4)

where 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization and 𝐳q(𝒆i)subscript𝐳𝑞subscript𝒆𝑖\bm{\mathrm{z}}_{q}(\bm{e}_{i})bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the quantized vector after the quantizer. This is equivalent to finding the closest neural embedding by cosine similarity and such 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT normalization improves the codebook utilization peng2022beitv2 .

Neural Regressor.

The neural regressor consists of a Transformer decoder and a stack of transposed convolution layers. Given a sequence of the vector-quantized embeddings 𝒵={𝒛i|i=1,,N}𝒵conditional-setsubscript𝒛𝑖𝑖1𝑁\mathcal{Z}=\{\bm{z}_{i}|i=1,...,N\}caligraphic_Z = { bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N }, the neural regressor convert these discrete embeddings back into raw sEEG signals 𝒳~={𝒙~i|i=1,,N}~𝒳conditional-setsubscript~𝒙𝑖𝑖1𝑁\tilde{\mathcal{X}}=\{\tilde{\bm{x}}_{i}|i=1,...,N\}over~ start_ARG caligraphic_X end_ARG = { over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N }. The mean squared error (MSE) loss is utilized to guide the regression. The total loss for training the neural tokenizer (i.e., Du-IN VQ-VAE model) is defined as:

vqvae=i=1N[𝒙~i𝒙i22+𝐬𝐠[𝐳c(𝒆i)]𝐳q(𝒆i)22+β𝐳c(𝒆i)𝐬𝐠[𝐳q(𝒆i)]22],subscript𝑣𝑞𝑣𝑎𝑒superscriptsubscript𝑖1𝑁delimited-[]superscriptsubscriptnormsubscript~𝒙𝑖subscript𝒙𝑖22superscriptsubscriptnorm𝐬𝐠delimited-[]subscript𝐳𝑐subscript𝒆𝑖subscript𝐳𝑞subscript𝒆𝑖22𝛽superscriptsubscriptnormsubscript𝐳𝑐subscript𝒆𝑖𝐬𝐠delimited-[]subscript𝐳𝑞subscript𝒆𝑖22\mathcal{L}_{vqvae}=\sum_{i=1}^{N}\Big{[}||\tilde{\bm{x}}_{i}-\bm{x}_{i}||_{2}%^{2}+||\bm{\mathrm{sg}}[\bm{\mathrm{z}}_{c}(\bm{e}_{i})]-\bm{\mathrm{z}}_{q}(%\bm{e}_{i})||_{2}^{2}+\beta||\bm{\mathrm{z}}_{c}(\bm{e}_{i})-\bm{\mathrm{sg}}[%\bm{\mathrm{z}}_{q}(\bm{e}_{i})]||_{2}^{2}\Big{]},caligraphic_L start_POSTSUBSCRIPT italic_v italic_q italic_v italic_a italic_e end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT [ | | over~ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + | | bold_sg [ bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] - bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_β | | bold_z start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - bold_sg [ bold_z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ,(5)

where 𝐬𝐠𝐬𝐠\bm{\mathrm{sg}}bold_sg represents the stop-gradient operation, which is an identity at the forward pass and has zero gradients. To stabilize the codebook update, we use the exponential moving average strategy van2017neural .

3.4 Pre-training Du-IN

Masked sEEG Modeling.

To enforce Du-IN learning contextual representations, we propose masked sEEG modeling. The whole procedure is presented in Figure 3. As illustrated in Figure 2, given a sEEG sample 𝒳𝒳\mathcal{X}caligraphic_X, the spatial encoder first transforms it to patch embeddings p={𝒆ip|i=1,,N}subscript𝑝conditional-setsubscriptsuperscript𝒆𝑝𝑖𝑖1𝑁\mathcal{E}_{p}=\{\bm{e}^{p}_{i}|i=1,...,N\}caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = { bold_italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N }. Given these patch embeddings psubscript𝑝\mathcal{E}_{p}caligraphic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, around 50% of patch embeddings are patch-wisely chosen and masked. The masked position is termed as \mathcal{M}caligraphic_M. Then, a shared learnable embedding 𝒆[M]dsubscript𝒆delimited-[]𝑀superscript𝑑\bm{e}_{[M]}\in\mathbb{R}^{d}bold_italic_e start_POSTSUBSCRIPT [ italic_M ] end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is used to replace the original patch embeddings:

m={𝒆im|i=1,,N},𝒆im=δ(i)𝒆[M]+(1δ(i)𝒆ip,\mathcal{E}_{m}=\{\bm{e}^{m}_{i}|i=1,...,N\},\quad\bm{e}^{m}_{i}=\delta(i\in%\mathcal{M})\odot\bm{e}_{[M]}+(1-\delta(i\in\mathcal{M})\odot\bm{e}^{p}_{i},caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = { bold_italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_N } , bold_italic_e start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_δ ( italic_i ∈ caligraphic_M ) ⊙ bold_italic_e start_POSTSUBSCRIPT [ italic_M ] end_POSTSUBSCRIPT + ( 1 - italic_δ ( italic_i ∈ caligraphic_M ) ⊙ bold_italic_e start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,(6)

where δ()𝛿\delta(\cdot)italic_δ ( ⋅ ) is the indicator function. After that, the masked embeddings msubscript𝑚\mathcal{E}_{m}caligraphic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT will be added by temporal embeddings, and then fed into the Transformer encoder. The output embeddings \mathcal{E}caligraphic_E will be used to predict the corresponding neural tokens through a linear classifier:

p(zi|𝒆i)=softmax(Linear(𝒆i)),𝑝conditionalsubscript𝑧𝑖subscript𝒆𝑖softmaxLinearsubscript𝒆𝑖p(z_{i}|\bm{e}_{i})=\mathrm{softmax}(\mathrm{Linear}(\bm{e}_{i})),italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_softmax ( roman_Linear ( bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ,(7)

The training loss of mask modeling is defined as:

=imilogp(zi|𝒆i).subscriptsubscript𝑖direct-productsubscript𝑚𝑖log𝑝conditionalsubscript𝑧𝑖subscript𝒆𝑖\mathcal{L}_{\mathcal{M}}=-\sum_{i\in\mathcal{M}}m_{i}\odot\mathrm{log}\ p(z_{%i}|\bm{e}_{i}).caligraphic_L start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_M end_POSTSUBSCRIPT italic_m start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⊙ roman_log italic_p ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | bold_italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .(8)
Symmetric Masking.

Inspired by LaBraM jiang2024large , we further introduce a symmetric masking strategy to improve training efficiency. We calculate the inverse of the generated mask \mathcal{M}caligraphic_M, obtaining ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG. Similarly, we use the new mask ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG to perform the mask modeling, obtaining the mask modeling loss symsuperscriptsubscript𝑠𝑦𝑚\mathcal{L}_{\mathcal{M}}^{sym}caligraphic_L start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT. The total loss for pre-training the Du-IN model (i.e., Du-IN MAE model) is defined as:

mae=+sym.subscript𝑚𝑎𝑒subscriptsuperscriptsubscript𝑠𝑦𝑚\mathcal{L}_{mae}=\mathcal{L}_{\mathcal{M}}+\mathcal{L}_{\mathcal{M}}^{sym}.caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_e end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT + caligraphic_L start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s italic_y italic_m end_POSTSUPERSCRIPT .(9)

4 Experiments

4.1 Dataset

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. moses2021neuroprosthesis to collect a well-annotated Chinese word-reading sEEG dataset (vocal production), including 12 subjects. The subjects undergo a surgical procedure to implant 7 to 13 invasive sEEG electrodes, each with 72 to 158 channels, in their brain. For each subject, the dataset contains 15 hours of 2000Hz recordings, 3 hours of which are task recordings.

Pre-training dataset.

For each subject, the pre-training dataset contains all sEEG recordings (with about 54 million timestamps) of that subject. To stabilize computing resource usage, the time length of sEEG sample 𝒳𝒳\mathcal{X}caligraphic_X is set to 4 seconds.

Downstream dataset.

For each subject, 3 hours of the sEEG recordings are task recordings. The sEEG signals are segmented into about 3000 3-second samples, each of which is paired with the corresponding word label (from 61 pre-determined words).

4.2 Implementation Details

Preprocess.

We first filter the sEEG signals between 0.5Hz and 200Hz to remove low-frequency noise. Then, a notch filter of 50Hz is applied to avoid power-line interference. After that, all sEEG signals are resampled to 1000Hz and bi-polar re-referenced li2018optimal . Finally, we perform z-score normalization on each channel to guarantee normalized data scales across all channels.

Model Configurations.

The length of the sEEG patch is 100ms, resulting in 40 patches per sample in the pre-training dataset and 30 patches per sample in the downstream dataset. The "Spatial Encoder" contains one linear projection and three 1-D convolution layers, transforming the original sEEG patches into patch embeddings with d=160𝑑160d=160italic_d = 160. The following "Transformer Encoder" contains an 8-layer Transformer encoder with model dimension d=160𝑑160d=160italic_d = 160, inner dimension (FFN) dff=320subscript𝑑𝑓𝑓320d_{ff}=320italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT = 320, and 8 attention heads. See Appendix C for more details.

Pre-training.

During the pre-training, we use either all sEEG recordings (15 hours) or the sEEG recordings without task recordings (12 hours) to train the Du-IN VQ-VAE and Du-IN MAE models. To enhance the robustness of learned codexes and representations, we further use data augmentation described in Appendix D. For each subject, the model is pre-trained on a Linux system with 2 CPUs (Intel Xeon Gold 6230 40-Core Processor) and 1 GPU (NVIDIA Tesla V100 32GB) for similar-to\sim 1.2 days.

Fine-tuning.

During the downstream evaluation, we split the task recordings into training, validation, and testing splits with a size roughly proportional to 80%, 10%, and 10%. We also use data augmentation, as described in Appendix D, to make the most of the gathered dataset. We employ cross-entropy loss (multi-class classification) as the training loss. Our experiments are conducted on one V100 GPU by Python 3.11.7 and PyTorch 2.1.2 + CUDA 12.3. The best models are trained based on the training set, selected from the validation set, and finally evaluated on the test set. For model comparison, we report the average and standard error values (of all subjects) on six different random seeds to obtain comparable results. For the results of the subject-wise evaluation, we report the average and standard deviation values (of each subject) in Appendix I.

4.3 Channel Contribution and Selection

As demonstrated in Section 2.1, previous neuroscience studies reveal that vocal production predominantly engages specific brain regions. Given the sparse distribution of implanted sEEG electrodes (each containing 8-16 channels), it’s vital to exclude redundant electrodes unrelated to vocal production, thus improving decoding performance. We retain electrodes implanted in relevant brain regions and evaluate the performance based on the remaining electrodes. Table 1 demonstrates that excluding approximately 85% electrodes even leads to a dramatic increase in decoding performance.

Methods# of Channels (Averaged)Accuracy (%) ±plus-or-minus\pm± Ste (%)
Du-IN (w/o electrode selection)109.7530.12±plus-or-minus\pm±5.64
Du-IN (w/ electrode selection)12.2555.92±plus-or-minus\pm±4.96

To further understand the detailed contribution of each channel, we analyze the weights of linear projection in the spatial encoder. In detail, we calculate the contribution scores of channels per subject and organize them accordingly, as described in Appendix H. Figure 4 demonstrates that (1) the brain regions effective for speech decoding align with findings from previous neuroscience research, and (2) our model achieves optimal decoding performance with approximately 10 channels, 80% of which originate from the same electrode. To streamline, we utilize these top 10 channels for both pre-training and downstream evaluation.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (4)

4.4 Comparasion with Other Models

Table 2 presents the results of our Du-IN model, previous sEEG self-supervised models, and other supervised baselines. The results demonstrate that our Du-IN model outperforms all baselines. It’s worth noting that the models (i.e., BrainBERT, Brant) that build univariate representations based on a single channel perform worse than the models building multi-variate representations, challenging the generalizability of current strategies to model spatial relationships among channels with Transformer.

As the BrainBERT model doesn’t consider the spatial relationships among channels, we mainly focus on understanding why Brant fails to capture the spatial relationships on the speech decoding task effectively. Compared to epilepsy detection, speech is a more difficult task, demanding intricate processing in specific brain regions. However, unlike LaBraM jiang2024large , Brant doesn’t introduce spatial embeddings to identify the spatial location of each channel. Since the electrodes are sparsely distributed in the brain and the raw sEEG signals on the same electrode are highly correlated, it’s fairly easy to identify their spatial relationships through their values. As demonstrated in iTransformer liu2023itransformer , this modeling approach is well suited for detecting time-delay events, e.g., epilepsy detection.

For speech decoding tasks, sEEG often requires bi-polar re-reference (or Laplacian re-reference) to remove the high correlations among channels, thus avoiding model overfitting wang2023brainbert . Once the correlations among channels have been removed, Brant will lose the ability to model spatial relationships among channels. Meanwhile, Brant only involves a few sEEG location configurations (i.e., 10 subjects) during its pre-training stage. Based on their methods, more sEEG location configurations should be required to learn generic relationships among channels.

MethodsPre-trainedModel SizeAccuracy (%) ±plus-or-minus\pm± Ste (%)
EEG-Conformersong2022eeg 2.34M45.82±plus-or-minus\pm±4.66
CNN-BiGRUmoses2021neuroprosthesis 0.54M32.04±plus-or-minus\pm±5.45
BrainBERTwang2023brainbert 43.58M7.50±plus-or-minus\pm±1.76
Brantzhang2024brant 69.35M12.42±plus-or-minus\pm±4.10
Du-IN4.38M56.29±plus-or-minus\pm±5.20
Du-IN (vqvae+vq)4.38M44.17±plus-or-minus\pm±4.04
Du-IN (vqvae)4.38M58.24±plus-or-minus\pm±4.83
Du-IN (mae)4.38M62.70±plus-or-minus\pm±4.69

4.5 Ablation Study

Self-Supervision Initialization.

As illustrated in Figure 3, the Du-IN model entails a two-stage pre-training process, wherein both the Du-IN VQ-VAE model and the Du-IN MAE model are trained. Previous studies utilize different strategies duan2023dewave ; chen2024eegformer ; jiang2024large to leverage these pre-trained models to enhance the performance of downstream tasks. Here, we evaluate these different strategies for comparison; see Appendix C.3 for detailed definitions. Table 2 shows that initializing weights from the Du-IN MAE model captures contextual embeddings effectively, resulting in the highest decoding performance.

Pre-training with/without Downstream Datasets.

During the pre-training stage, we hope that the Du-IN VQ-VAE model can extract general tokens of that brain region, thus guiding the Du-IN MAE model to learn general representations that are not specific to any particular task. Although no label data is used during the pre-training stage, to eliminate the influence of the pre-training data on downstream tasks, we compare the results with or without incorporating the downstream task dataset into the pre-training stage.Table 3 shows a slight performance drop when excluding downstream datasets. However, the decoding performance is still higher than the baseline performance without pre-training, which means that the degradation is mainly due to the decrease of the pre-training dataset. We hope that, with more pure recordings, our model can achieve better decoding performance.

MethodsPre-training Dataset SizeAccuracy (%) ±plus-or-minus\pm± Ste (%)
Du-IN (mae w/o downstream dataset)12 hours per subject60.02±plus-or-minus\pm±4.34
Du-IN (mae w/ downstream dataset)15 hours per subject62.70±plus-or-minus\pm±4.69
Discrete Codex.

During the neural tokenizer training stage, the Du-IN VQ-VAE model encodes sEEG patches into discrete codexes and then reconstructs the original signal from these codexes. We evaluate performance against varying codex sizes (512 to 8192) to ascertain if codex size affects the quality of the learned codebook. As illustrated in Figure 5, while extremely small codex size lacks representation diversity, extremely large codex size often leads to codebook collapse. We suspect that our existing training data might not be adequate for larger codex sizes. Furthermore, our experiments suggest that the model performs optimally when the codex dimension, denoted as dcodex=64subscript𝑑𝑐𝑜𝑑𝑒𝑥64d_{codex}=64italic_d start_POSTSUBSCRIPT italic_c italic_o italic_d italic_e italic_x end_POSTSUBSCRIPT = 64, is slightly less than the model dimension, d=160𝑑160d=160italic_d = 160, yielding a more effective regularization effect.

Perception Time Window.

We also conduct the ablation study on the model structure for the spatial encoder described in Section 3.2. As the spatial encoder transforms the sEEG signals within a given patch to a patch embedding, it compresses the sEEG signals for perception. As described in Section 4.2, the model utilizes a perception field of 100ms. We conduct an ablation study of different perception fields and report it in Figure 5. The model performance notably drops with a perception field smaller than 60ms and gradually declines as the perception field exceeds 160ms. The model reaches a small peak around 100ms to 140ms. We think this phenomenon is rational since sEEG is known for its ability to capture the rapid dynamics of specific brain regions precisely.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (5)

5 Limitations

Despite Du-IN’s enhancements in speech decoding via discrete codebook-guided mask modeling, it is still restricted to close-set speech decoding tasks (i.e., the word set only includes 61 pre-determined words). However, a parallel to our work, Feng et al. feng2023high build an acoustic-inspired model that can decode arbitrary Chinese words by predicting syllable components (initials, finals, tones). Although their method requires a large amount of labeled data, their experimental design mirrors ours closely. The difference lies in the requirement for the subject to repeat syllable components, instead of entire words. Therefore, with slight modifications, our model can support open-set speech decoding tasks.

Additionally, the experiments in this paper are restricted to the vocal production part of language decoding, i.e. speech decoding. A more interesting but difficult task is to decode language from the semantic level, in which large language models have been wildly used to improve the model performance tang2023semantic ; duan2023dewave . However, due to the locality of sEEG recordings, it is still under exploration whether sEEG recordings can fully capture semantic-related information across brain regions.

6 Conclusion

This paper proposes Du-IN, a framework for speech decoding, which learns contextual embeddings through discrete codebook-guided mask modeling on specific brain regions. To evaluate our model, we collect a well-annotated Chinese word-reading sEEG dataset to address the lack of sEEG language dataset. Inspired by neuroscientific findings, we analyze the effective brain regions for speech decoding and achieve the best decoding performance with about one electrode in specific brain regions, which dovetails with the past neuroscientific research on language. Comprehensive experiments demonstrate that our model outperforms both supervised and sEEG-based self-supervised baselines, effectively capturing the intricate processing within specific brain regions. It marks a promising neuro-inspired AI approach in BCI. In the end, we hope our work can have implications for future developments in sEEG-based self-supervised models with more consideration over how to build the basic representation units so that the model can maximally benefit from the pre-training stage.

Ethics Statement

Experiments that contribute to this work were approved by IRB. All subjects consent to participate. All electrode locations are exclusively dictated by clinical considerations.

Reproducibility Statement

Code to train models and reproduce the results is submitted as part of the supplementary materials and can be accessed here: TODO, including a demo dataset of 3 subjects for downstream fine-tuning.

References

  • (1)Miguel Angrick, MaartenC Ottenhoff, Lorenz Diener, Darius Ivucic, Gabriel Ivucic, Sophocles Goulis, Jeremy Saal, AlbertJ Colon, Louis Wagner, DeanJ Krusienski, etal.Real-time synthesis of imagined speech processes from minimally invasive recordings of neural activity.Communications biology, 4(1):1055, 2021.
  • (2)JimmyLei Ba, JamieRyan Kiros, and GeoffreyE Hinton.Layer normalization.arXiv preprint arXiv:1607.06450, 2016.
  • (3)Hangbo Bao, LiDong, Songhao Piao, and Furu Wei.Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021.
  • (4)JeffreyR Binder, JulieA Frost, ThomasA Hammeke, RobertW Cox, StephenM Rao, and Thomas Prieto.Human brain language areas identified by functional magnetic resonance imaging.Journal of Neuroscience, 17(1):353–362, 1997.
  • (5)KristoferE Bouchard, Nima Mesgarani, Keith Johnson, and EdwardF Chang.Functional organization of human sensorimotor cortex for speech articulation.Nature, 495(7441):327–332, 2013.
  • (6)Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, JaredD Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, etal.Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020.
  • (7)Josh Chartier, GopalaK Anumanchipalli, Keith Johnson, and EdwardF Chang.Encoding of articulatory kinematic trajectories in human speech sensorimotor cortex.Neuron, 98(5):1042–1054, 2018.
  • (8)Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever.Generative pretraining from pixels.In International conference on machine learning, pages 1691–1703. PMLR, 2020.
  • (9)Yuqi Chen, Kan Ren, Kaitao Song, Yansen Wang, Yifan Wang, Dongsheng Li, and Lili Qiu.Eegformer: Towards transferable and interpretable large-scale eeg foundation model.arXiv preprint arXiv:2401.10278, 2024.
  • (10)CheolJun Cho, Edward Chang, and Gopala Anumanchipalli.Neural latent aligner: cross-trial alignment for learning representations of complex, naturalistic neural data.In International Conference on Machine Learning, pages 5661–5676. PMLR, 2023.
  • (11)Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, AndreasPeter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, etal.Scaling vision transformers to 22 billion parameters.In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023.
  • (12)Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.Bert: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805, 2018.
  • (13)BenjaminK Dichter, JonathanD Breshears, MatthewK Leonard, and EdwardF Chang.The control of vocal pitch in human laryngeal motor cortex.Cell, 174(1):21–31, 2018.
  • (14)Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, etal.An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020.
  • (15)Yiqun Duan, Jinzhao Zhou, Zhen Wang, Yu-Kai Wang, and Chin-Teng Lin.Dewave: Discrete eeg waves encoding for brain dynamics to text translation.arXiv preprint arXiv:2309.14030, 2023.
  • (16)Chen Feng, LuCao, DiWu, EnZhang, Ting Wang, Xiaowei Jiang, Heng Ding, Chenhao Zhou, Jinbo Chen, Hui Wu, etal.A high-performance brain-sentence communication designed for logosyllabic language.bioRxiv, pages 2023–11, 2023.
  • (17)Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick.Masked autoencoders are scalable vision learners.In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  • (18)Sergey Ioffe and Christian Szegedy.Batch normalization: Accelerating deep network training by reducing internal covariate shift.In International conference on machine learning, pages 448–456. pmlr, 2015.
  • (19)Weibang Jiang, Liming Zhao, and Bao liang Lu.Large brain model for learning generic representations with tremendous EEG data in BCI.In The Twelfth International Conference on Learning Representations, 2024.
  • (20)DiederikP Kingma and Max Welling.Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013.
  • (21)Demetres Kostas, Stephane Aroca-Ouellette, and Frank Rudzicz.Bendr: using transformers and a contrastive self-supervised learning task to learn from massive amounts of eeg data.Frontiers in Human Neuroscience, 15:653659, 2021.
  • (22)Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer.Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension.arXiv preprint arXiv:1910.13461, 2019.
  • (23)Guangye Li, Shize Jiang, SivyllaE Paraskevopoulou, Meng Wang, Yang Xu, Zehan Wu, Liang Chen, Dingguo Zhang, and Gerwin Schalk.Optimal referencing for stereo-electroencephalographic (seeg) recordings.NeuroImage, 183:327–335, 2018.
  • (24)Yong Liu, Tengge Hu, Haoran Zhang, Haixu Wu, Shiyu Wang, Lintao Ma, and Mingsheng Long.itransformer: Inverted transformers are effective for time series forecasting.arXiv preprint arXiv:2310.06625, 2023.
  • (25)Yuchen Liu and Ziyu Jia.Bstt: A bayesian spatial-temporal transformer for sleep staging.In The Eleventh International Conference on Learning Representations, 2022.
  • (26)SeanL Metzger, KayloT Littlejohn, AlexanderB Silva, DavidA Moses, MargaretP Seaton, Ran Wang, MaximilianE Dougherty, JessieR Liu, Peter Wu, MichaelA Berger, etal.A high-performance neuroprosthesis for speech decoding and avatar control.Nature, 620(7976):1037–1046, 2023.
  • (27)DavidA Moses, SeanL Metzger, JessieR Liu, GopalaK Anumanchipalli, JosephG Makin, PengfeiF Sun, Josh Chartier, MaximilianE Dougherty, PatriciaM Liu, GaryM Abrams, etal.Neuroprosthesis for decoding speech in a paralyzed person with anarthria.New England Journal of Medicine, 385(3):217–227, 2021.
  • (28)Zhiliang Peng, LiDong, Hangbo Bao, Qixiang Ye, and Furu Wei.BEiT v2: Masked image modeling with vector-quantized visual tokenizers.2022.
  • (29)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, etal.Improving language understanding by generative pre-training.2018.
  • (30)Ali Razavi, Aaron Vanden Oord, and Oriol Vinyals.Generating diverse high-fidelity images with vq-vae-2.Advances in neural information processing systems, 32, 2019.
  • (31)BlakeA Richards, TimothyP Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal Bogacz, Amelia Christensen, Claudia Clopath, RuiPonte Costa, Archy deBerker, Surya Ganguli, etal.A deep learning framework for neuroscience.Nature neuroscience, 22(11):1761–1770, 2019.
  • (32)Andrew Saxe, Stephanie Nelli, and Christopher Summerfield.If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021.
  • (33)Jingwei Sheng, LiZheng, Bingjiang Lyu, Zhehang Cen, Lang Qin, LiHai Tan, Ming-Xiong Huang, Nai Ding, and Jia-Hong Gao.The cortical maps of hierarchical linguistic structures during speech perception.Cerebral cortex, 29(8):3232–3240, 2019.
  • (34)Yonghao Song, Qingqing Zheng, Bingchuan Liu, and Xiaorong Gao.Eeg conformer: Convolutional transformer for eeg decoding and visualization.IEEE Transactions on Neural Systems and Rehabilitation Engineering, 31:710–719, 2022.
  • (35)Jerry Tang, Amanda LeBel, Shailee Jain, and AlexanderG Huth.Semantic reconstruction of continuous language from non-invasive brain recordings.Nature Neuroscience, 26(5):858–866, 2023.
  • (36)Aaron Van DenOord, Oriol Vinyals, etal.Neural discrete representation learning.Advances in neural information processing systems, 30, 2017.
  • (37)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin.Attention is all you need.Advances in neural information processing systems, 30, 2017.
  • (38)Christopher Wang, Vighnesh Subramaniam, AdamUri Yaari, Gabriel Kreiman, Boris Katz, Ignacio Cases, and Andrei Barbu.Brainbert: Self-supervised representation learning for intracranial recordings.arXiv preprint arXiv:2302.14367, 2023.
  • (39)Haixu Wu, Tengge Hu, Yong Liu, Hang Zhou, Jianmin Wang, and Mingsheng Long.Timesnet: Temporal 2d-variation modeling for general time series analysis.In The eleventh international conference on learning representations, 2022.
  • (40)KeYi, Yansen Wang, Kan Ren, and Dongsheng Li.Learning topology-agnostic eeg representations with geometry-aware modeling.Advances in Neural Information Processing Systems, 36, 2024.
  • (41)Daoze Zhang, Zhizhang Yuan, Yang Yang, Junru Chen, Jingjing Wang, and Yafeng Li.Brant: Foundation model for intracranial neural signal.Advances in Neural Information Processing Systems, 36, 2024.
  • (42)Hui Zheng, Zhongtao Chen, Haiteng Wang, Jianyang Zhou, Lin Zheng, and Yunzhe Liu.Universal sleep decoder: Aligning awake and sleep neural representation across subjects.arXiv preprint arXiv:2309.16457, 2023.

Appendix A Experiment Design

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (6)

Due to the lack of open-source sEEG datasets related to language tasks, we follow the experimental design outlined by Moses et al. [27] to collect a well-annotated Chinese word-reading sEEG dataset, including 12 subjects (9 male, 3 female; aged 15-53, μ𝜇\muitalic_μ 27.8, σ𝜎\sigmaitalic_σ 10.4) with pharmacologically intractable epilepsy.

In the word-reading task, the subject speaks aloud individual words from a 61-word set while we simultaneously record his brain activity (measured by sEEG) and voice. The word set is chosen based on the following criteria:

  • The versatility of the words in generating a range of sentences.

  • The simplicity of using the words to express fundamental caregiving requirements.

  • The diversity of word pronunciations to cover as many Chinese pronunciation combinations as possible.

A list of the words contained in this 61-word set is provided in Table 4.

All data are collected as a series of "blocks" (25 blocks in total), with each block lasting about 10 minutes and consisting of multiple trials. During each block of this task, all words (from the 61-word set) are presented individually twice, leading to a total of 122 trials.

Each trial in a block of this task starts with one word shown on the screen in white text. After 0.5 seconds, the text will turn green and remain on the screen for 2 seconds. This color transition from white to green represents the go cue for each trial, and the subject is instructed to speak the word aloud as soon as the text turns green. Afterward, the text will be replaced with a blank screen with a centered cross. After 0.5 seconds, the task continues to the next trial. The word presentation order is randomized within each task block.

Besides, we also collected non-task recordings of subjects in their daily life. Apart from sleep periods, there are roughly 12 hours of non-task recordings during wakefulness. In summary, for each subject, we collect about 15 hours of sEEG recordings, of which 3 hours are task recordings.

WordsTranslationsWordsTranslationsWordsTranslations
{CJK}UTF8gbsn嘴巴mouth{CJK}UTF8gbsn菠萝pineapple{CJK}UTF8gbsn帮助help
{CJK}UTF8gbsn把get{CJK}UTF8gbsn朋友friend{CJK}UTF8gbsn脸盆washbasin
{CJK}UTF8gbsn平静calm{CJK}UTF8gbsn漂亮pretty{CJK}UTF8gbsn衣服clothes
{CJK}UTF8gbsn豆腐tofu{CJK}UTF8gbsn米饭rice{CJK}UTF8gbsn放在put on
{CJK}UTF8gbsn面条noodle{CJK}UTF8gbsn毛巾towel{CJK}UTF8gbsn关门close the door
{CJK}UTF8gbsn电脑computer{CJK}UTF8gbsn凳子stool{CJK}UTF8gbsn小刀knife
{CJK}UTF8gbsn头疼headache{CJK}UTF8gbsn软糖gummies{CJK}UTF8gbsn醋vinegar
{CJK}UTF8gbsn青菜vegetables{CJK}UTF8gbsn厕所toilet{CJK}UTF8gbsn葱花chopped green onion
{CJK}UTF8gbsn手机cell phone{CJK}UTF8gbsn篮球basketball{CJK}UTF8gbsn钢琴piano
{CJK}UTF8gbsn心情mood{CJK}UTF8gbsn丝瓜loofah{CJK}UTF8gbsn蒜泥garlic paste
{CJK}UTF8gbsn怎样how{CJK}UTF8gbsn香肠sausage{CJK}UTF8gbsn需要need
{CJK}UTF8gbsn你you{CJK}UTF8gbsn拿hold{CJK}UTF8gbsn橙汁orange juice
{CJK}UTF8gbsn找look for{CJK}UTF8gbsn猪肉pork{CJK}UTF8gbsn吃eat
{CJK}UTF8gbsn穿wear{CJK}UTF8gbsn是be{CJK}UTF8gbsn家人family
{CJK}UTF8gbsn热水hot water{CJK}UTF8gbsn护士nurse{CJK}UTF8gbsn换药change dressing
{CJK}UTF8gbsn喝drink{CJK}UTF8gbsn口渴thirsty{CJK}UTF8gbsn看look
{CJK}UTF8gbsn碗bowl{CJK}UTF8gbsn鱼块steak{CJK}UTF8gbsn感觉feel
{CJK}UTF8gbsn给give{CJK}UTF8gbsn玩play{CJK}UTF8gbsn问题problem
{CJK}UTF8gbsn外卖takeouts{CJK}UTF8gbsn有have{CJK}UTF8gbsn音乐music
{CJK}UTF8gbsn预约reserve{CJK}UTF8gbsn汤圆sweet dumpling{CJK}UTF8gbsn愿意willing
{CJK}UTF8gbsn我I----

Appendix B Details of Baselines

In experiments, we compare our model to the existing supervised or self-supervised methods on brain signals. The details of these baseline models are given here:

  • EEG-Conformer[34]: A supervised model that consists of both CNN module and Transformer module, to encapsulate local and global features in a unified EEG classification framework. EEG-Conformer is mainly designed for EEG-based motor imagination tasks. Since the data modes of EEG and sEEG are similar, and the signals primarily pertain to vocal production, EEG-Conformer is suitable to serve as a baseline for comparison.

  • CNN-BiGRU[27]: A supervised model that consists of both CNN module and Bi-GRU module, to capture contextual features from EEG signals. This model is mainly designed for ECoG-based vocal production tasks, similar to ours. Since ECoG and sEEG are both intracranial neural signals of the brain, this model is suitable to serve as a baseline for comparison.

  • BrainBERT[38]: A self-supervised model for sEEG recordings that bridges modern representation learning approaches to neuroscience. BrainBERT builds universal representation based on the superlet spectrograms of one single sEEG channel without modeling the spatial relationships among channels. Since the downstream tasks for BrainBERT are also related to language decoding (e.g., sentence-onset detection, speech vs. non-speech detection, etc.), BrainBERT is suitable to serve as a baseline for comparison.

  • Brant[41]: A self-supervised model for sEEG recordings that can capture both long-term temporal dependency and spatial correlation from neural signals. Brant is mainly designed for medicine, serving as a sEEG foundation model. Although Brant mainly evaluates its performance on the low-level modeling tasks [39] (e.g., neural signal forecasting, imputation, etc.), Brant achieves SOTA performance on some high-level modeling tasks (e.g., seizure detection). As a foundation model in sEEG pre-training field, Brant is suitable to serve as a baseline for comparison.

When evaluating the decoding performance of these baseline models, we follow the same experiment setup of the Du-IN CLS model:

  • For one subject, we split the downstream dataset into training, validation, and testing splits with a size roughly proportional to 80%, 10%, and 10%.

  • The data samples are 3 seconds with the specified sampling rate corresponding to each model.

  • The samples in the train-set are augmented following the pipeline defined in Appendix D.

For the self-supervised methods, the pre-training setup follows the original setup of each model:

  • For the BrainBERT model, we use around 180 hours of sEEG recordings from all 12 subjects for pre-training. This pre-training dataset is larger than the one (approximately 45 hours) used in the original paper.

  • For the Brant model, we also use all sEEG recordings from 12 subjects to pre-train it. While the total pre-training dataset is smaller than the one (around 2700 hours) used in the original paper, the number of subjects (i.e., the number of sEEG location configurations) is greater than in the original paper.

Appendix C Model Details

C.1 Du-IN VQ-VAE

The architecture of the Du-IN VQ-VAE model contains three parts: (1) Neural Tokenizer, (2) Vector Quantizer, and (3) Neural Regressor. The overall architecture of "Neural Tokenizer" is shown in Figure 2. The "Vector Quantizer" is implemented similarly in LaBraM[19]. The "Neural Regressor" contains:

  • Transformer Decoder: A stack of Transformer layers.

  • Time Regression Head: A stack of 1D Transposed Convolution layers and one linear projection layer.

The hyperparameters for Du-IN VQ-VAE training are shown in Table 5.

ModuleSub-ModuleNameValue
Neural Tokenizer(Neural Transformer) Spatial EncoderLinear Projection1016101610\rightarrow 1610 → 16
# of Input Channels{16,128,128}
# of Output Channels{128,128,16}
Kernel Size{19,3,3}
Stride{10,1,1}
Padding{9,1,1}
Transformer Encoder# of Transformer Layers8
Hidden Size160
MLP Size320
MLP Dropout Ratio{0.2,0.}
# of Attention Heads8
Attention Head Size64
Attention Dropout Ratio0.2
Vector Quantizer -Codex Size2048×642048642048\times 642048 × 64
Embedding-to-Codex Projection160160(Tanh)64160160𝑇𝑎𝑛64160\rightarrow 160(Tanh)\rightarrow 64160 → 160 ( italic_T italic_a italic_n italic_h ) → 64
Codex-to-Embedding Projection641606416064\rightarrow 16064 → 160
Neural Regressor Transformer Decoder# of Transformer Layers4
Hidden Size160
MLP Size320
MLP Dropout Ratio{0.2,0.}
# of Attention Heads8
Attention Head Size64
Attention Dropout Ratio0.2
Time Regression Head# of Input Channels{160,128,128,128,128}
# of Output Channels{128,128,128,128,16}
Kernel Size{3,3,10,9,19}
Stride{1,1,10,1,10}
Padding-
Output Padding-
Linear Projection1610161016\rightarrow 1016 → 10
Optimizer -Batch Size64
Maximum Learning Rate3e-4
Minimum Learning Rate5e-5
Learning Rate SchedulerCosine
Optimizer TypeAdamW
Adam β𝛽\betaitalic_β(0.9,0.99)0.90.99(0.9,0.99)( 0.9 , 0.99 )
Weight Decay0.01
Total Epochs400
Warm-up Epochs40

C.2 Du-IN MAE

The architecture of the Du-IN MAE model contains two parts: (1) Neural Encoder, and (2) Token Prediction Head. The overall architecture of the "Neural Encoder" is shown in Figure 2. The hyperparameters of "Neural Encoder" are the same as those of "Neural Tokenizer" in Du-IN VQ-VAE. It’s worth noting that when training Du-IN MAE, the weights of the "Neural Encoder" are randomly initialized, instead of loaded from the pre-trained Du-IN VQ-VAE model. The hyperparameters for Du-IN MAE training are shown in Table 6.

ModuleSub-ModuleNameValue
Token Prediction Head -Linear Projection16020481602048160\rightarrow 2048160 → 2048
Optimizer -Batch Size64
Maximum Learning Rate3e-4
Minimum Learning Rate5e-5
Learning Rate SchedulerCosine
Optimizer TypeAdamW
Adam β𝛽\betaitalic_β(0.9,0.99)0.90.99(0.9,0.99)( 0.9 , 0.99 )
Weight Decay0.05
Total Epochs400
Warm-up Epochs40

C.3 Du-IN CLS

The architecture of the Du-IN CLS model contains two parts: (1) Neural Encoder, and (2) Label Prediction Head. The overall architecture of the "Neural Encoder" is shown in Figure 2. The hyperparameters of "Neural Encoder" are the same as those of "Neural Tokenizer" in Du-IN VQ-VAE. It’s worth noting that the "Neural Encoder" weights in Du-IN CLS can be loaded from either the pre-trained Du-IN MAE or the pre-trained Du-IN VQ-VAE. In the ablation experiments shown in Table 2, our models have different suffixes:

  • Du-IN: The original Du-IN CLS model. All weights of this model are randomly initialized.

  • Du-IN (vqvae+vq): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN VQ-VAE. When fine-tuning it on the downstream task, the "Vector Quantizer" in the pre-trained Du-IN VQ-VAE is inserted between "Neural Encoder" and "Label Prediction Head". This is the same operation in DeWave[15].

  • Du-IN (vqvae): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN VQ-VAE. This is the same operation in EEGFormer [9].

  • Du-IN (mae): The weights of the "Neural Encoder" in the Du-IN CLS model are loaded from the pre-trained Du-IN MAE. This is the same operation in LaBraM [19].

The "Label Prediction Head" is an MLP with one hidden layer, flattens the output embedding sequence from upstream, and maps this feature embedding to the final prediction through MLP. The hyperparameters for Du-IN CLS training are shown in Table 7.

ModuleSub-ModuleNameValue
Label Prediction Head -Flatten-
Linear Projection30×160128(ReLU)6130160128𝑅𝑒𝐿𝑈6130\times 160\rightarrow 128(ReLU)\rightarrow 6130 × 160 → 128 ( italic_R italic_e italic_L italic_U ) → 61
Optimizer -Batch Size32
Maximum Learning Rate2e-4
Minimum Learning Rate5e-6
Learning Rate SchedulerCosine
Optimizer TypeAdamW
Adam β𝛽\betaitalic_β(0.9,0.99)0.90.99(0.9,0.99)( 0.9 , 0.99 )
Weight Decay0.05
Total Epochs200
Warm-up Epochs20

Appendix D Data Augmentation

To enhance the robustness of learned representations during both the pre-training and fine-tuning stages, we apply data augmentation in both datasets.

Pre-training Dataset.

In our implementation, we segment sEEG recordings into 8-second samples with a 4-second overlap. When fetching a sample, we randomly select a starting point between 0 and 4 seconds, then extract a 4-second sample beginning from that point.

Downstream Dataset.

Since a trial lasts for 3 seconds, employing the jittering mentioned above leads to the blending of information from other trials. In our implementation, we segment sEEG recordings into 3-second samples. When fetching a sample, we randomly choose a shift step between 0 and 0.3 seconds, then shift the sample either to the left or right, padding it with zeros.

Appendix E Du-IN Pre-training Analysis

The pre-training of Du-IN can be interpreted as the training of a variational autoencoder [20, 3]. Let x𝑥xitalic_x denote the original sEEG signal, x~~𝑥\tilde{x}over~ start_ARG italic_x end_ARG the corrupted sEEG by mask, and z𝑧zitalic_z the neural tokens. Considering the evidence lower bound (ELBO) of the log-likelihood p(x|x~)𝑝conditional𝑥~𝑥p(x|\tilde{x})italic_p ( italic_x | over~ start_ARG italic_x end_ARG ), i.e., recovering the original sEEG signal from its corrupted version:

(xi,x~i)𝒟logp(xi|x~i)(xi,x~i)𝒟(𝔼ziqϕ(𝐳|xi)[logpψ(xi|zi)]NeuralTokenReconstructionDKL[qϕ(𝐳|xi),pθ(𝐳|x~i)]),subscriptsubscript𝑥𝑖subscript~𝑥𝑖𝒟log𝑝conditionalsubscript𝑥𝑖subscript~𝑥𝑖subscriptsubscript𝑥𝑖subscript~𝑥𝑖𝒟subscriptsubscript𝔼similar-tosubscript𝑧𝑖subscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖delimited-[]logsubscript𝑝𝜓conditionalsubscript𝑥𝑖subscript𝑧𝑖NeuralTokenReconstructionsubscript𝐷KLsubscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖subscript𝑝𝜃conditional𝐳subscript~𝑥𝑖\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\mathrm{log}\ p(x_{i}|\tilde{x}_{i})%\geq\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\Big{(}\underbrace{\mathbb{E}_{z%_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}[\mathrm{log}\ p_{\psi}(x_{i}|z_{i})]%}_{\mathrm{Neural\ Token\ Reconstruction}}-D_{\mathrm{KL}}[q_{\phi}(\bm{%\mathrm{z}}|x_{i}),p_{\theta}(\bm{\mathrm{z}}|\tilde{x}_{i})]\Big{)},∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT ( under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT roman_Neural roman_Token roman_Reconstruction end_POSTSUBSCRIPT - italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT [ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_z | over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ) ,(10)

where (1) qϕ(z|x)subscript𝑞italic-ϕconditional𝑧𝑥q_{\phi}(z|x)italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_z | italic_x ) denotes the neural tokenizer that obtains neural tokens; (2) pψ(x|z)subscript𝑝𝜓conditional𝑥𝑧p_{\psi}(x|z)italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_x | italic_z ) decodes the original sEEG signal given input neural tokens; (3) pθ(z|x~)subscript𝑝𝜃conditional𝑧~𝑥p_{\theta}(z|\tilde{x})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_z | over~ start_ARG italic_x end_ARG ) recovers the neural tokens based on the masked sEEG signal, which is our Du-IN pre-training task.

The whole framework is optimized through a two-stage procedure as [36, 30]. For the first stage, we train the neural tokenizer as a discrete variational autoencoder by minimizing the reconstruction loss 𝔼ziqϕ(𝐳|xi)logpψ(x~i|zi)subscript𝔼similar-tosubscript𝑧𝑖subscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖logsubscript𝑝𝜓conditionalsubscript~𝑥𝑖subscript𝑧𝑖-\mathbb{E}_{z_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}\mathrm{log}\ p_{\psi}(%\tilde{x}_{i}|z_{i})- blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with a uniform prior. For the second stage, we set qϕsubscript𝑞italic-ϕq_{\phi}italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT as well as pψsubscript𝑝𝜓p_{\psi}italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT fixed and learn the prior pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by minimizing the loss DKLsubscript𝐷KLD_{\mathrm{KL}}italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT. For simplicity, qϕ(𝐳|xi)subscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖q_{\phi}(\bm{\mathrm{z}}|x_{i})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is defined as a one-point distribution with the most likely neural tokens z^i=argmaxzqϕ(𝐳|xi)subscript^𝑧𝑖subscript𝑧subscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖\hat{z}_{i}=\mathop{\arg\max}\limits_{z}q_{\phi}(\bm{\mathrm{z}}|x_{i})over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_BIGOP roman_arg roman_max end_BIGOP start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Consequently, we can rewrite Equation 10 as

(xi,x~i)𝒟logp(xi|x~i)(xi,x~i)𝒟(𝔼ziqϕ(𝐳|xi)[logpψ(x~i|zi)]NeuralTokenReconstruction+logpθ(z^i|x~i)MaskedsEEGModeling),subscriptsubscript𝑥𝑖subscript~𝑥𝑖𝒟log𝑝conditionalsubscript𝑥𝑖subscript~𝑥𝑖subscriptsubscript𝑥𝑖subscript~𝑥𝑖𝒟subscriptsubscript𝔼similar-tosubscript𝑧𝑖subscript𝑞italic-ϕconditional𝐳subscript𝑥𝑖delimited-[]logsubscript𝑝𝜓conditionalsubscript~𝑥𝑖subscript𝑧𝑖NeuralTokenReconstructionsubscriptlogsubscript𝑝𝜃conditionalsubscript^𝑧𝑖subscript~𝑥𝑖MaskedsEEGModeling\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\mathrm{log}\ p(x_{i}|\tilde{x}_{i})%\geq\sum_{(x_{i},\tilde{x}_{i})\in\mathcal{D}}\Big{(}\underbrace{\mathbb{E}_{z%_{i}\sim q_{\phi}(\bm{\mathrm{z}}|x_{i})}[\mathrm{log}\ p_{\psi}(\tilde{x}_{i}%|z_{i})]}_{\mathrm{Neural\ Token\ Reconstruction}}+\underbrace{\mathrm{log}\ p%_{\theta}(\hat{z}_{i}|\tilde{x}_{i})}_{\mathrm{Masked\ sEEG\ Modeling}}\Big{)},∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT roman_log italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ caligraphic_D end_POSTSUBSCRIPT ( under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] end_ARG start_POSTSUBSCRIPT roman_Neural roman_Token roman_Reconstruction end_POSTSUBSCRIPT + under⏟ start_ARG roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Masked roman_sEEG roman_Modeling end_POSTSUBSCRIPT ) ,(11)

where the first term is the objective for vector-quantized neural signal regression in the first stage (i.e., the Du-IN VQ-VAE model), and the second term is the objective for Du-IN pre-training in the second stage (i.e., the Du-IN MAE model).

Appendix F Visualization of Vector-Quantized sEEG Regression

We further visualize how the sEEG signals are reconstructed. As depicted in Figure 7, although some details are missing, the overall trend of the signals is reconstructed well. Meanwhile, there is a stable decrease in the reconstruction loss during training, which indicates the discrete codebook does learn high-level information from sEEG signals.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (7)

Appendix G Visualization of Mask sEEG Modeling

Figure 8 demonstrates the convergence curves of the total pre-training loss and masked sEEG modeling accuracy of the Du-IN MAE model. We observe that there is a stable decrease in the mask modeling loss, and the mask modeling accuracy achieves about 20%, which is similar to [19].

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (8)

Appendix H Channel Contribution Analysis

For each subject, after training the Du-IN model (with random initialization) on the downstream dataset, we utilize the weights WC×D𝑊superscript𝐶𝐷W\in\mathbb{R}^{C\times D}italic_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × italic_D end_POSTSUPERSCRIPT of linear projection in the spatial encoder to calculate the contribution scores 𝒮𝒮\mathcal{S}caligraphic_S of channels:

𝒮={si|i=1,,C},si=1Dj=1D|Wij|,formulae-sequence𝒮conditional-setsubscript𝑠𝑖𝑖1𝐶subscript𝑠𝑖1𝐷superscriptsubscript𝑗1𝐷subscript𝑊𝑖𝑗\mathcal{S}=\{s_{i}|i=1,...,C\},\quad s_{i}=\frac{1}{D}\sum_{j=1}^{D}|W_{ij}|,caligraphic_S = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_i = 1 , … , italic_C } , italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_D end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT | italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | ,(12)

where C𝐶Citalic_C is the number of channels, D𝐷Ditalic_D is the dimension of projected embedding and |||\cdot|| ⋅ | gets the absolute value. Then, we normalize 𝒮𝒮\mathcal{S}caligraphic_S using its maximum value to ensure it falls within the [0,1] range. Finally, given the variability in model performance across subjects, we further adjust the channel contribution scores based on the decoding performance of that subject, i.e., 𝒮={sip|i=1,,C}𝒮conditional-setsubscript𝑠𝑖𝑝𝑖1𝐶\mathcal{S}=\{s_{i}\cdot p|i=1,...,C\}caligraphic_S = { italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_p | italic_i = 1 , … , italic_C }, where p𝑝pitalic_p represents the decoding performance of that subject.

After calculating the channel contribution scores of all subjects, we project them to the standard brain template according to the MNI (Montreal Neurological Institute) locations of channels, using Nilearn 0.9.2. Since the electrodes are sparsely distributed within the brain, we use Scipy 1.8.1 to interpolate and smooth the channel contribution matrix and use NiLearn to plot the channel contribution map demonstrated in Figure 4(a).

With the sorted channels within each subject, we evaluate the effect of the number of channels on the decoding performance. For each subject, we evaluate the Du-IN model with {5,10,15,20,30,60}51015203060\{5,10,15,20,30,60\}{ 5 , 10 , 15 , 20 , 30 , 60 } channels (sorted by channel contribution scores), and the averaged performance (across subjects) is demonstrated in Figure 4(b).

Appendix I Subject-Wise Evaluation

The detailed performance of different methods from each subject is provided in Table 8, Table 9, and Table 10, with the best in bold and the second underlined. For model comparison, we report the average and standard deviation values (within each subject) on six different random seeds to obtain comparable results. "Std" means standard deviation.

MethodsPre-trainedAccuracy (%) ±plus-or-minus\pm± Std (%)
subj-01subj-02subj-03subj-04
EEG-Conformer[34]58.41±plus-or-minus\pm±1.0369.82±plus-or-minus\pm±1.2219.50±plus-or-minus\pm±1.7149.65±plus-or-minus\pm±2.38
CNN-BiGRU[27]46.46±plus-or-minus\pm±4.0368.06±plus-or-minus\pm±1.564.35±plus-or-minus\pm±0.4417.68±plus-or-minus\pm±3.88
BrainBERT[38]6.76±plus-or-minus\pm±0.6425.64±plus-or-minus\pm±1.232.97±plus-or-minus\pm±0.665.09±plus-or-minus\pm±0.44
Brant[41]7.47±plus-or-minus\pm±2.8354.26±plus-or-minus\pm±1.633.34±plus-or-minus\pm±0.384.15±plus-or-minus\pm±1.35
Du-IN71.25±plus-or-minus\pm±1.4477.99±plus-or-minus\pm±0.8723.04±plus-or-minus\pm±4.7659.91±plus-or-minus\pm±4.58
Du-IN (vqvae+vq)50.15±plus-or-minus\pm±3.8062.79±plus-or-minus\pm±4.6720.72±plus-or-minus\pm±2.1548.24±plus-or-minus\pm±2.65
Du-IN (vqvae)72.36±plus-or-minus\pm±1.5579.16±plus-or-minus\pm±1.1229.21±plus-or-minus\pm±2.3863.83±plus-or-minus\pm±1.83
Du-IN (mae)78.60±plus-or-minus\pm±0.7983.61±plus-or-minus\pm±0.3838.80±plus-or-minus\pm±2.5270.98±plus-or-minus\pm±0.81
MethodsPre-trainedAccuracy (%) ±plus-or-minus\pm± Std (%)
subj-05subj-06subj-07subj-08
EEG-Conformer[34]65.44±plus-or-minus\pm±1.3131.06±plus-or-minus\pm±2.5847.89±plus-or-minus\pm±1.8642.12±plus-or-minus\pm±2.08
CNN-BiGRU[27]51.26±plus-or-minus\pm±4.9331.52±plus-or-minus\pm±1.4847.75±plus-or-minus\pm±1.1224.64±plus-or-minus\pm±4.44
BrainBERT[38]11.28±plus-or-minus\pm±1.173.00±plus-or-minus\pm±0.475.31±plus-or-minus\pm±0.555.22±plus-or-minus\pm±0.76
Brant[41]28.83±plus-or-minus\pm±1.935.28±plus-or-minus\pm±1.489.70±plus-or-minus\pm±1.858.93±plus-or-minus\pm±1.39
Du-IN77.60±plus-or-minus\pm±1.2041.91±plus-or-minus\pm±1.8059.63±plus-or-minus\pm±2.2052.35±plus-or-minus\pm±2.18
Du-IN (vqvae+vq)63.46±plus-or-minus\pm±2.2834.84±plus-or-minus\pm±1.9845.20±plus-or-minus\pm±2.4440.14±plus-or-minus\pm±1.77
Du-IN (vqvae)78.56±plus-or-minus\pm±1.2443.29±plus-or-minus\pm±1.6762.29±plus-or-minus\pm±1.4954.10±plus-or-minus\pm±1.34
Du-IN (mae)81.56±plus-or-minus\pm±1.1146.90±plus-or-minus\pm±1.0265.45±plus-or-minus\pm±1.7459.09±plus-or-minus\pm±0.98
MethodsPre-trainedAccuracy (%) ±plus-or-minus\pm± Std (%)
subj-09subj-10subj-11subj-12
EEG-Conformer[34]56.51±plus-or-minus\pm±1.9822.22±plus-or-minus\pm±1.0757.10±plus-or-minus\pm±2.0329.87±plus-or-minus\pm±1.44
CNN-BiGRU[27]44.03±plus-or-minus\pm±5.887.11±plus-or-minus\pm±0.7128.44±plus-or-minus\pm±3.4213.17±plus-or-minus\pm±3.41
BrainBERT[38]7.20±plus-or-minus\pm±1.372.49±plus-or-minus\pm±0.4310.60±plus-or-minus\pm±1.224.41±plus-or-minus\pm±0.88
Brant[41]6.46±plus-or-minus\pm±1.663.00±plus-or-minus\pm±0.319.82±plus-or-minus\pm±1.717.82±plus-or-minus\pm±1.66
Du-IN66.39±plus-or-minus\pm±0.4727.07±plus-or-minus\pm±2.2473.56±plus-or-minus\pm±1.0944.76±plus-or-minus\pm±3.74
Du-IN (vqvae+vq)60.06±plus-or-minus\pm±1.6122.05±plus-or-minus\pm±1.7650.31±plus-or-minus\pm±4.6932.06±plus-or-minus\pm±3.28
Du-IN (vqvae)67.18±plus-or-minus\pm±1.2231.06±plus-or-minus\pm±1.5972.41±plus-or-minus\pm±1.9845.38±plus-or-minus\pm±2.26
Du-IN (mae)69.18±plus-or-minus\pm±1.9634.23±plus-or-minus\pm±1.1775.52±plus-or-minus\pm±1.2748.54±plus-or-minus\pm±0.56

Appendix J Subject-Wise Electrode Locations

We provide detailed information on the locations of the implanted sEEG electrodes for each subject. Red channels are the top 10 channels (selected through channel contribution analysis) for both pre-training and downstream evaluation, as described in Section 4.3. As the majority of subjects have sEEG electrodes implanted on only one side of their brains to locate the source of epilepsy, we provide side views of either the left or right brain areas here.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (9)
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (10)
Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (11)

NeurIPS Paper Checklist

  1. 1.

    Claims

  2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

  3. Answer: [Yes]

  4. Justification: Three of the four contributions mentioned at the end of the "Introduction" section are explicitly included. The contribution related to neuroscience-inspired analysis is simplified as "inspired by neuroscience findings" at the end of the "Abstraction" section.

  5. 2.

    Limitations

  6. Question: Does the paper discuss the limitations of the work performed by the authors?

  7. Answer: [Yes]

  8. Justification: We provide a separate "Limitations" section; see Section 5 for more details.

  9. 3.

    Theory Assumptions and Proofs

  10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?

  11. Answer: [N/A]

  12. Justification: The paper does not include theoretical results.

  13. 4.

    Experimental Result Reproducibility

  14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?

  15. Answer: [Yes]

  16. Justification: We provide detailed information related to our model and baselines in Appendix C and Appendix B, respectively. Besides, we provide code and demo dataset in Section 6.

  17. 5.

    Open access to data and code

  18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?

  19. Answer: [Yes]

  20. Justification: We provide code and demo dataset in Section 6. Due to the lack of open-source sEEG datasets related to language, we collected a well-annotated Chinese word-reading sEEG dataset, and evaluated our model on this dataset. However, respecting the efforts of the data collectors, we only provide the dataset of some subjects to reproduce the experimental results in the main text. The whole dataset will be publicly available to ensure the reproducibility of this work.

  21. 6.

    Experimental Setting/Details

  22. Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?

  23. Answer: [Yes]

  24. Justification: We provide detailed information related to our model and baselines in Appendix C and Appendix B, respectively.

  25. 7.

    Experiment Statistical Significance

  26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?

  27. Answer: [Yes]

  28. Justification: For the main results, we report the average and standard error values (of all subjects) on six random seeds. For detailed subject-wise evaluation, we report the average and standard deviation values (of each subject) on six random seeds.

  29. 8.

    Experiments Compute Resources

  30. Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?

  31. Answer: [Yes]

  32. Justification: Detailed information related to the training process is provided in Section 4.2

  33. 9.

    Code Of Ethics

  34. Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?

  35. Answer: [Yes]

  36. Justification: The research conducted in the paper conforms, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines.

  37. 10.

    Broader Impacts

  38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?

  39. Answer: [No]

  40. Justification: This work aims to explore the feasibility of intracranial neural signals to decode speech, which mainly has positive impacts on society.

  41. 11.

    Safeguards

  42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

  43. Answer: [No]

  44. Justification: The paper poses no such risks.

  45. 12.

    Licenses for existing assets

  46. Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?

  47. Answer: [Yes]

  48. Justification: The creators or original owners of assets (e.g., code, data, models), used in the paper, are properly credited and are the license and terms of use explicitly mentioned and properly respected.

  49. 13.

    New Assets

  50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?

  51. Answer: [N/A]

  52. Justification: The paper does not release new assets.

  53. 14.

    Crowdsourcing and Research with Human Subjects

  54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?

  55. Answer: [Yes]

  56. Justification: The ethics statements are provided in Section 6.

  57. 15.

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects

  58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?

  59. Answer: [Yes]

  60. Justification: The ethics statements are provided in Section 6.

Du-IN: Discrete units-guided mask modeling for decoding speech from Intracranial Neural signals (2024)

References

Top Articles
Latest Posts
Article information

Author: Prof. Nancy Dach

Last Updated:

Views: 5503

Rating: 4.7 / 5 (57 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Prof. Nancy Dach

Birthday: 1993-08-23

Address: 569 Waelchi Ports, South Blainebury, LA 11589

Phone: +9958996486049

Job: Sales Manager

Hobby: Web surfing, Scuba diving, Mountaineering, Writing, Sailing, Dance, Blacksmithing

Introduction: My name is Prof. Nancy Dach, I am a lively, joyous, courageous, lovely, tender, charming, open person who loves writing and wants to share my knowledge and understanding with you.