[B.plantarii] mutant vs wild type

Data_Analysis

[B.plantarii] mutant vs wild type

rudgh99_algo 2024. 11. 13. 20:19

0. wild type과 mutant의 전사체 데이터를 비교해 보는 프로젝트를 진행하였다.

따라서, 필요한 데이터는 각 type의 전사체 데이터, annotation 파일 그리고 참조유전체 파일이다.

1. fastq 파일을 bam 파일로 바꾸자.

fastq는 header 부분과 quality score 가 들어있는 파일이다.

이제 이 파일을 bam 파일로 바꿔야 된다. 그러기 위해서는 필요한 툴들부터 설치해 보자. (참고로, 이 프로젝트는 구글 코랩에서 진행하였다.)

# hisat2와 samtools 설치
!apt-get update
!apt-get install -y hisat2 samtools

위 코드를 실행하고 난 후,

# 참조 유전체 경로 설정
reference_path = '/content/drive/MyDrive/plantarii_wgs/ncbi_dataset/data/GCF_030644525.1/GCF_030644525.1_ASM3064452v1_genomic.fna'

# hisat2 인덱스 생성 (reference.fasta 파일을 바탕으로)
!hisat2-build {reference_path} /content/drive/MyDrive/B_plantarii/ref/reference_index

참조유전체 경로를 설정해 주고, hisat2를 이용해 ref 유전체 파일에 인덱싱 작업을 해준다.

# 각 FASTQ 파일 경로 설정
fastq_files = [
    ('/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-1_1.fastq.gz', '/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-1_2.fastq.gz'),
    ('/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-2_1.fastq.gz', '/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-2_2.fastq.gz'),
    ('/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-3_1.fastq.gz', '/content/drive/MyDrive/B_plantarii_Transcriptomics_WT_aroA_mutant/rawDataAroaPlantariiMutant/BM-3_2.fastq.gz')
]

# 각 쌍에 대해 BAM 파일 생성
for i, (fastq1, fastq2) in enumerate(fastq_files, start=1):
    output_bam_path = f'/content/drive/MyDrive/B_plantarii/output/BM-{i}_output.bam'
    !hisat2 -x /content/drive/MyDrive/B_plantarii/ref/reference_index -1 {fastq1} -2 {fastq2} | samtools view -Sb - > {output_bam_path}

이후, fastq 파일을 참조유전체를 사용해 mapping을 하고, bam 파일로 변환한다.

# 각 BAM 파일 정렬
for i in range(1, 4):
    input_bam_path = f'/content/drive/MyDrive/B_plantarii/output/BM-{i}_output.bam'
    sorted_bam_path = f'/content/drive/MyDrive/B_plantarii/output/BM-{i}_output_sorted.bam'
    !samtools sort {input_bam_path} -o {sorted_bam_path}

위에서 만들어진 bam 파일을 정렬해 준다.

# 각 정렬된 BAM 파일에 색인 생성
for i in range(1, 4):
    sorted_bam_path = f'/content/drive/MyDrive/B_plantarii/output/BM-{i}_output_sorted.bam'
    !samtools index {sorted_bam_path}

마지막으로 인덱싱을 해준다. bai 파일이 생성될 건데, 아직 이용하지는 않았다.

위의 코드를 순차적으로 실행하면 된다. 하지만, 위의 코드는 mutant type만 다루는 코드이기 때문에 , wild type 파일도 위의 코드에 추가해서 실행하면 된다.

2. annotation file을 전처리

B.plantarii KCC 같은 경우는 chr1, chr2, plasmid1으로 유전체가 구성되어 있다. 따라서, 각각의 annotation file이 존재한다.

annotation 파일 3개를 하나의 파일로 합쳐준다.

# GFF 파일 병합
!cat /content/drive/MyDrive/B_plantarii/gff/plantarii_chr1.gff3 \
     /content/drive/MyDrive/B_plantarii/gff/plantarii_chr2.gff3 \
     /content/drive/MyDrive/B_plantarii/gff/p1.gff3 > /content/drive/MyDrive/B_plantarii/gff/combined_annotations.gff3

아래의 예시코드를 참고해서 합치면 된다. 아웃풋으로 combined_annotation.gff3가 생성된다.

# gffread 설치
!apt-get install -y gffread

# GFF3 파일을 GTF로 변환
!gffread /content/drive/MyDrive/B_plantarii/gff/combined_annotations.gff3 -T -o /content/drive/MyDrive/B_plantarii/gff/combined_annotations.gtf

이제 위 코드를 실행해서, gff3 파일에서 gtf 파일형식으로 변환한다.

3. DEG를 찾아보자

# 셸 명령어 셀에 입력하고 실행
# 업데이트 및 필수 도구 설치
!apt-get update
!apt-get install -y samtools subread

# R 설치
!apt-get install -y r-base

# R 패키지 설치를 위한 BiocManager 설치
!R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"

# DESeq2 및 기타 필요한 R 패키지 설치
!R -e "BiocManager::install(c('DESeq2', 'ggplot2', 'pheatmap', 'RColorBrewer'))"

위의 코드를 실행시켜, 필요한 패키치를 설치하자. 시간이 꽤 걸린다.

!featureCounts -T 4 -p -t CDS -g transcript_id -a /content/drive/MyDrive/B_plantarii/gff/combined_annotations.gtf -o /content/drive/MyDrive/B_plantarii/counts/count_matrix.txt \
  /content/drive/MyDrive/B_plantarii/output/BM-1_output_sorted.bam \
  /content/drive/MyDrive/B_plantarii/output/BM-2_output_sorted.bam \
  /content/drive/MyDrive/B_plantarii/output/BM-3_output_sorted.bam \
  /content/drive/MyDrive/B_plantarii/output/sample1w_sorted.bam \
  /content/drive/MyDrive/B_plantarii/output/sample2w_sorted.bam \
  /content/drive/MyDrive/B_plantarii/output/sample3w_sorted.bam

설치가 끝났다면, featureCounts를 실행해, 각 유전자에 몇 개의 read들이 붙었는지 파일로 만들자.

이때 옵션으로 -p를 준다. 전사체 데이터가 paired end로 sequencing 되었기 때문에 넣어줘야 한다. 또한, -t CDS 옵션을 줘야 한다. -t gene_id로 하면 제대로 counting이 되지 않았다.

%load_ext rpy2.ipython
%%R
# 필요한 라이브러리 불러오기
library(DESeq2)

# count 데이터 불러오기
countData <- read.table("/content/drive/MyDrive/B_plantarii/counts/count_matrix.txt", header=TRUE, row.names=1)
gene_id <- rownames(countData)
countData <- countData[, -c(1:5)]  # 비샘플 열 제거 (Chr, Start, End, Strand, Length)

# 샘플 이름과 일치하도록 열 이름 수정
colnames(countData) <- c("BM-1", "BM-2", "BM-3", "BW-1", "BW-2", "BW-3")

# 샘플 정보 정의
sampleNames <- c("BM-1", "BM-2", "BM-3", "BW-1", "BW-2", "BW-3")
conditions <- factor(c("mutant", "mutant", "mutant", "wild_type", "wild_type", "wild_type"))
colData <- data.frame(row.names=sampleNames, condition=conditions)

# DESeq2 데이터셋 생성 및 분석 수행
dds <- DESeqDataSetFromMatrix(countData=countData, colData=colData, design=~condition)
dds <- DESeq(dds)

# 결과 추출 및 저장
res <- results(dds)
write.csv(as.data.frame(res), "/content/drive/MyDrive/B_plantarii/counts/DEG_results.csv")

위의 코드를 실행하면, R환경에서 DEG를 찾을 것이고, 결과를 파일로 만들 것이다.

%%R
# DESeq2 결과 불러오기
deg_results <- read.csv("/content/drive/MyDrive/B_plantarii/counts/DEG_results.csv", row.names = 1)

# p값이 0.05 이하인 DEG 필터링
deg_significant <- subset(deg_results, padj < 0.05)

# 새로운 CSV 파일로 저장
write.csv(deg_significant, "/content/drive/MyDrive/B_plantarii/counts/DEG_results_significant.csv")

유의 수준 0.05 이하인 DEG만 다시 선별해 준다.

4. DEG파일에 gene_function 추가

만들어진 DEG 파일에 gene_function이 없어, 추가하고 싶었다.

import pandas as pd
gff_path = "/content/drive/MyDrive/B_plantarii/gff/combined_annotations.gff3"
deg_path = "/content/drive/MyDrive/B_plantarii/counts/DEG_results_significant.csv"
output_path = "/content/drive/MyDrive/B_plantarii/counts/DEG_results_with_function.csv"

gff, deg, output 경로를 각각 설정한다.

deg_df = pd.read_csv(deg_path)
deg_df = deg_df.rename(columns={"Unnamed: 0": "gene_id"})
gene_info = []
current_gene_id = None
with open(gff_path,"r") as f:
  for line in f:
    if line.startswith("#") or line.strip() == "":
      continue
    ## tab 문자로 나누기
    columns = line.strip().split('\t')
    if columns[2] == "gene":
      attributes = columns[8]
      current_gene_id = None
      for attribute in attributes.split(';'):
        if attribute.startswith("ID="):
          current_gene_id = attribute.split('=')[1]

    elif columns[2] == "CDS":
      attributes = columns[8]
      product = None
      for attribute in attributes.split(';'):
        if attribute.startswith("product="):
          product = attribute.split('=')[1]
          if product:
            gene_info.append((current_gene_id, product))
          current_gene_id = None

gff_df = pd.DataFrame(gene_info, columns = ["gene_id","function"])
deg_df = pd.read_csv(deg_path)
deg_df = deg_df.rename(columns={"Unnamed: 0": "gene_id"})
merged_df = pd.merge(deg_df,gff_df, on = "gene_id", how = "left")

merged_df.to_csv(output_path, index = False)

deg 파일에 0번째 column의 id가 없어서, rename 메서드로 gene_id를 부여했다.

이후, deg 파일에 있는 유전자만 기능을 추가해 주었다.

5.MA plot과 Bar plot

Bar Plot

import pandas as pd 
import matplotlib.pyplot as plt 
deg_df = pd.read_csv("/content/drive/MyDrive/B_plantarii/counts/DEG_results_with_function.csv", index_col = "gene_id")
top_10 = deg_df.nlargest(10,"log2FoldChange")
bottom_10 = deg_df.nsmallest(10,"log2FoldChange")
top_bottom_df = pd.concat([top_10,bottom_10]) 

plt.figure(figsize = (12,8)) 
plt.bar(top_bottom_df.index, top_bottom_df["log2FoldChange"]) 
plt.xticks(rotation = 90) 
plt.xlabel("Gene ID") 
plt.ylabel("log2FoldChange") 
plt.title("Top 10 and bottom 10 Genes by log2FoldChange") 
plt.show()

output :

log2FoldChange 상위,하위 10개씩을 기준으로 생성한 Bar_plot

MA Plot

plt.figure(figsize=(10, 8))
plt.scatter(deg_df['baseMean'], deg_df['log2FoldChange'], alpha=0.5)
plt.xscale('log')
plt.xlabel('Base Mean (log scale)')
plt.ylabel('log2FoldChange')
plt.title('MA Plot')
plt.show()

output :