2024. 11. 16. 21:59ㆍData_Analysis
1. Lactiplantibacillus plantarum에서 발현되는 ferritin 데이터 생성.
ncbi protein 사이트에 접속해서 "Ferritin Lactiplantibacillus plantarum" 입력했다.
총 43개의 searching result가 나왔고, 이 파일들을 다 모아서 ferritinL.plantarum.raw.fasta 파일을 만들었다.
>ALO75854.1 ferritin (plasmid) [Lactiplantibacillus plantarum]
MKYTKTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLHPLMDEFMEEIDSQLDVISERLIALDGSPYS
TLKEMAENTKIQDWPGEWDKTTPERLAHLVDGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKI
WMIQAELGSAPEVDE
>TEA94412.1 ferritin [Lactiplantibacillus plantarum]
MKYTKTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLHPLMDEFMEEIDSQLDVISERLIALDGSPYS
TLKEMAENTKIQDWPGEWDKTTPERLAHLVDGYRCLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKI
WMIQAELGSAPEIDE
>TEA92010.1 ferritin [Lactiplantibacillus plantarum]
MKYTKTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLHPLMDEFMEEIDSQLDVISERLIALDGSPYS
TLKEMAENTKIQDWPGEWDKTTPERLAHLVDGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKI
WMIQAELGSAPEIDE
>TEA91991.1 ferritin, partial [Lactiplantibacillus plantarum]
MKYTKTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
>PHM02949.1 ferritin (plasmid) [Lactiplantibacillus plantarum]
MKYTKTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLHPLMDEFMEEIDSQLDVISERLIALDGSPYS
TLKEMAENTKIQDWPGEWDKTTPERLAHLVDGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKI
WMIQAELGSAPEIDE
>BBA81365.1 ribonucleoside-diphosphate reductase, beta chain [Lactiplantibacillus plantarum]
MATDLAYYQKLLSNGNYKAINWDRVSDAIDKSTWEKLTEQFWLDTRIPVSNDMADWRELDDDHRWVVGHV
FGGLTLLDTLQSQDGLQALRRNVLTSHETAVLNNIQFMESVHAKSYSTIFETLNTPDEINEIFDWSDSEE
FLQAKAQWIYKLYDNIDEDPLKQKVANVFLETFLFYSGFYTPLYYLGHNQLPNVAEIIKLILRDESVHGT
YIGYKFQLGFKDRSEKQQAEFKDWMFDFLYKLYENEENYIHLVYDQIGWSDEVLTFSRYNANKALMNLGQ
DALFPDTAEDVNPVVMNGISTGTSNHDFFSQVGNGYRLGQVEAMQDTDYDIGNPDD
>CDN28226.1 hypothetical protein LP80_1530 [Lactiplantibacillus plantarum]
MSELTIDEQYAAELKQSDIDHHVPTAGAMTNHILSNLMVAYVKLTQVKWYVKGPQSLALRTAYQRLLDQN
VRQFAELGELLLDENQKPSSTTAELTKYSMLEENGAFKYQSADELVAATIKDFDTENLFVDRAIKLAEKE
TRPALAAWLVAYRGSNNRNIRELQVYLGNDARTGLDEEDEDDD
파일의 일부분을 가져왔다. 자세히 살펴보면 ferritin이 아닌 다른 protein들도 들어가 있는 것을 확인할 수 있다.
2. 데이터 전처리
중복 seq, partial seq, "X"를 포함한 seq 그리고 ferritin이 아닌 seq는 제거했다.
from Bio import SeqIO
seq = SeqIO.parse("/home/rudlab/projects/ferritinL.plantarumAn/data/ferritinL.plantarum.raw.fasta","fasta")
seq_set = set()
with open("/home/rudlab/projects/ferritinL.plantarumAn/data/ferritinL.plantarum.fasta", "w") as handle:
for s in seq:
if "ferritin" not in s.description:
continue
if "X" in s.seq:
continue
if "partial" in s.description or "truncated" in s.description:
continue
if s.seq not in seq_set:
seq_set.add(s.seq)
SeqIO.write(s, handle, "fasta")
3. MSA 분석하기
https://www.ebi.ac.uk/jdispatcher/msa/muscle?stype=protein
https://www.ebi.ac.uk/jdispatcher/msa/muscle?stype=protein
www.ebi.ac.uk
muscle이라는 사이트에서 msa를 진행했다.
가공된 fasta 파일을 input으로 주면 된다.
output :
CLUSTAL multiple sequence alignment by MUSCLE (3.8)
EQM54203.1 -------MKYT-------------KTKEVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
TEA94412.1 -------MKYT-------------KTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
ALO75854.1 -------MKYT-------------KTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
KLD56812.1 -------MKYT-------------KTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
TEA92010.1 -------MKYT-------------KTKAVLNQLVADLSQMSMIIHQTHWYMRGPNFLKLH
BBM21960.1 MSELTIDEQYAAELKQSDIDHHVPTAGAMTNHILSNLMVAYVKLSQVKWYVKGPQSLALR
EPD22853.1 MSELTIDEQYAAELKQSDIDHHVPTAGAMTNHILSNLMVAYVKLTQVKWYVKGPQSLALR
:*: .: : *:::::* : : *.:**:.**: * *.
EQM54203.1 PLMDEFMEEIDSQLDVISERLIALDGNPYSTLKEMADNTKIKDWPGTWDKTTPERLAHLV
TEA94412.1 PLMDEFMEEIDSQLDVISERLIALDGSPYSTLKEMAENTKIQDWPGEWDKTTPERLAHLV
ALO75854.1 PLMDEFMEEIDSQLDVISERLIALDGSPYSTLKEMAENTKIQDWPGEWDKTTPERLAHLV
KLD56812.1 PLMDEFMEEIDSQLDVISERLIALDGSPYSTLKEMVENTKIQDWPGEWDKTTPERLAHLV
TEA92010.1 PLMDEFMEEIDSQLDVISERLIALDGSPYSTLKEMAENTKIQDWPGEWDKTTPERLAHLV
BBM21960.1 TEYQQLIDQNVRQFAELGDLLLDENQKPSSTTAELTKYSMLEENGAFKYQSADELVAATI
EPD22853.1 TAYQRLLDQNVRQFAELGELLLDENQKPSSTTAELTKYSMLEENGAFKYQSVDELVAATI
. : :::: *: :.: *: : .* ** *:.. : :::. . . ::. * :* :
EQM54203.1 DGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKIWMIQAELGSAPEI-------
TEA94412.1 DGYRCLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKIWMIQAELGSAPEI-------
ALO75854.1 DGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKIWMIQAELGSAPEV-------
KLD56812.1 DGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKIWMIQAELGSAPEI-------
TEA92010.1 DGYRYLEDLYQHGIEVSDVEKDFSTQDIFIGLKTAIEKKIWMIQAELGSAPEI-------
BBM21960.1 KDFDTENLFVDRAIKLAEKETRPALAAWLVAYRGSNNRNIRELQAYLGNDARTGLDEEDE
EPD22853.1 KDFDTENLFVDRAIKLAEKENRPALAAWLVAYRGSNNRNIRELQAYLGNDARTGLDEEDE
..: : : :..*:::: *. : ::. . : :.:*. :** **. .
EQM54203.1 --DE
TEA94412.1 --DE
ALO75854.1 --DE
KLD56812.1 --DE
TEA92010.1 --DE
BBM21960.1 DDDD
EPD22853.1 DDDD
*:
4. WebLogo 그리기
WebLogo - Create Sequence Logos
weblogo.berkeley.edu
웹로고는 이 사이트에서 그렸다.
output :

Ferritin-like domain protein, DNA-binding ferritin-like protein, DPS family, 그리고 ferritin의 서열 길이가 다르기 때문에 로고 플롯에서 빈 영역이 나타났다고 추측할 수 있다.