[Akkermansia muciniphila] Transcriptomics and metabolomics reveal the adaption of Akkermansia muciniphila to high mucin by regulating energy homeostasis 논문 데이터 분석(1)

Data_Analysis

[Akkermansia muciniphila] Transcriptomics and metabolomics reveal the adaption of Akkermansia muciniphila to high mucin by regulating energy homeostasis 논문 데이터 분석(1)

rudgh99_algo 2024. 11. 22. 12:54

※ Transcriptomics and metabolomics reveal the adaption of Akkermansia muciniphila to high mucin by regulating energy homeostasis 논문에서 쓰인 transcriptomics data를 직접 다운로드하여 분석하는 프로젝트를 진행하려 한다.

0. Reference Genome, Transcriptomics data 찾기

이 paper에서 "RNA isolation and analysis" 페이지에 Ref genome과 Transcriptomics data의 accession number를 제공해주고 있다. 정보를 정리해 보자면 다음과 같다.

Raw sequencing data (transcriptomics data) : SRA accession no. PRJNA559704

genome sequence of A.muciniphila : GenBank accession number CP042830

이제 accession number를 입력해서 다운로드하면 된다.

0_1. transcriptomics data

https://www.ncbi.nlm.nih.gov/sra/?term=PRJNA559704

PRJNA559704 - SRA - NCBI

www.ncbi.nlm.nih.gov

총 12개의 sample이 있다. BHI(control) , 0.05% mucin, 0.2% mucin, 0.5% mucin에서 배양된 sample이 각각 3개씩 들어있다. 우선은 , BHI (control) 1개만 다운로드하여서 사용해 보겠다.

https://trace.ncbi.nlm.nih.gov/Traces/? view=run_browser&acc=SRR10803469&display=download

SRA Archive: NCBI

Updated: Thu Nov 7 11:16:25 EST 2024

trace.ncbi.nlm.nih.gov

여기서 fastq 파일형식으로 filter옵션을 체크하고 다운로드하였다.

0_2. Reference genome

Reference genome은 Genbank에 들어가서, accession number를 입력하면 된다.

https://www.ncbi.nlm.nih.gov/genbank/

GenBank Overview

GenBank Overview What is GenBank? GenBank ® is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences (Nucleic Acids Research, 2013 Jan;41(D1):D36-42). GenBank is part of the International Nucleotide Sequence Da

www.ncbi.nlm.nih.gov

그러면 A.muciniphila에 접근할 수 있다.

https://www.ncbi.nlm.nih.gov/datasets/genome/?taxon=239935

Genome

Download a genome data package including genome, transcript and protein sequence, annotation and a data report

www.ncbi.nlm.nih.gov

이 페이지에 접속하면 다양한 strain 결과가 나오는데, paper에서는 strain DSM 22959를 사용했다고 하니, 이 strain을 다운로드하여 준다.

https://www.ncbi.nlm.nih.gov/nuccore/1727622584

Akkermansia muciniphila strain DSM 22959 chromosome, complete genome - Nucleotide - NCBI

Due to the large size of this record, sequence and annotated features are not shown. Use the "Customize view" panel to change the display. no features Feature First Previous Next Last Details

www.ncbi.nlm.nih.gov

여기서 fasta 형식으로 다운로드하여 준다.

1. 필요한 package 설치

이번 포스팅에서는 mapping 결과인 sam file을 생성하고, sam file을 bam file로 바꿔서, samtools로 mapping 된 곳을 보는 것까지 해보겠다.

따라서, 필요한 package는 hisat2, samtools가 필요하다. 그전에 내가 사용하는 환경은 Ubuntu 22.04 OS에서 anaconda3을 설치해서 이용하고 있다. conda python version은 3.9로 이용했다.

hisat2 : hisat2는 reference genome에 리드들을 mapping 해주는 software다.

conda install -c bioconda hisat2

samtools : samtools는 sam과 bam 파일을 다룰 때 사용하는 software다.

conda install -c bioconda samtools

2. Reference genome index 생성

hisat2를 이용해서 ref genome에 mapping 속도를 높이기 위해 인덱싱을 해줄 것이다.

hisat2-build reference genome fasta file 경로를 여기에 적자. \
             여기에 인덱싱 끝난 파일이 어디에 위치할지 적자.

indexing 과정이 끝나면 위의 사진에서와 같이 index 파일들이 생성된다.

3. Mapping을 시행

mapping을 시행하기 위해서는 fastq파일이 필요하다. 이 fastaq파일은 read들을 기록해 놓은 파일이다. 0단계에서 fastq 파일을 다운로드하면 pair end로 sequencing 하였기 때문에 2개의 파일이 있어야 한다. 하나는 forward read 나머지 하나는 reverse read다. 하지만, 한 개의 파일에 합쳐져 있어서 따로 forward read file과 reverse read file로 분리하는 과정이 필요했다.

아래 source code를 이용해 분리를 진행했다.

import gzip
def split_fastq_paired(input_file, output_r1, output_r2):
    with gzip.open(input_file, 'rt') as infile, \
         gzip.open(output_r1, 'wt') as r1, \
         gzip.open(output_r2, 'wt') as r2: 

        while True:
            # Forward reads 기록
            header_forward = infile.readline() 
            if not header_forward:
                break 
            sequence_forward = infile.readline() 
            plus_forward = infile.readline() 
            quality_forward = infile.readline()

            # Reverse reads 기록 
            header_reverse = infile.readline()
            if not header_reverse:
                break
            sequence_reverse = infile.readline() 
            plus_reverse = infile.readline() 
            quality_reverse = infile.readline() 

            r1.write(f"{header_forward}{sequence_forward}{plus_forward}{quality_forward}")
            r2.write(f"{header_reverse}{sequence_reverse}{plus_reverse}{quality_reverse}")

코드를 간단히 설명하자면, gzip을 이용해 파일을 open 하고, 4줄씩 잘라서 forward file과 reverse file에 기록했다.

따라서, 두 가지 파일을 생성했다.

이제 두 파일을 이용해서 mapping을 진행하면 된다.

hisat2 -x ~A.muciniphila_index \
       -1 forward_file 경로 \
       -2 reverse_file 경로 \
       -S 저장할 directory 위치 경로

이제 sam 파일이 생성되었다.

3. sam file --> bam file

sam file을 용량이 크기 때문에 bam file로 변환해서 사용하는 것이 좋다. 따라서, sam file을 bam file로 바꾸자.

samtools view -bS sam file 경로 \
  > 생성 될 bam file 경로

4. bam file sorting 시행

현재 bam file은 read들이 무작위로 나열되어 있다. 분석을 하기 위해서는 read들이 위치별로 정렬될 필요가 있다. 따라서, sorting을 실시한다.

samtools sort bam 파일 경로 \
  -o sorted bam file 저장 될 위치

5. bam file indexing

sorted bam file을 만든 후, 이번에도 indexing 과정을 거쳐 추후 분석에 용이하게 만들자.

samtools index sorted bam 파일 경로

6. samtools tview를 통해 mapping을 직접 보기

이제 samtools tview를 통해 mapping 이 어떻게 되었는지 확인해 보자.

samtools tview -p CP042830.1:1000 sorted bam 파일 경로 \
              reference genome fasta 파일 경로

output :

기호 해석 법 :

기호의미

. 또는 ,	read가 ref genome과 일치. (.은 Forward Strand, ,는 Reverse Strand)
> 또는 <	read의 방향. >는 Forward Strand, <는 Reverse Strand.
*	삽입/삭제(Indel). 읽기에서 삽입/삭제가 발생한 위치.
G, A, T, C	read와 ref genome가 불일치하는 경우, read에서 관찰된 염기를 나타냄 (Mismatch).
공백	해당 위치에 mapping된 read가 없음.