barcode

Biopython으로 시퀀스 가져오기

Coding/Python

참고로 다른 코딩글은 여기다가 안 올린다. 코드블럭도 없고 카테고리도 없어서 어지간한 코딩 이야기는 노션과 워드프레스에 올리는 중... 아니면 미디움이나. 근데 이건 우째도 올렸네? 아 생물학 카테고리가 있잖아요. 

Biopython은 생물정보학에서 쓰는건데 마침 심심하던 차에 잘됐다 써봐야징 해서 어제 깔았다. 리눅스의 경우 pip install bio(안되면 biopython)으로 걍 터미널에서 깔면 된다(우분투 20.02 LTS 기준). 윈도우마냥 pip 있는 데 안 찾아가도 됨 ㄹㅇ 편함. 근데 파이참 바로가기 없는 건 너무한 거 아니냐고... 내가 터미널 열고 직접 모시러 가야것냐... 

*Notes: Biopython을 사용하려면 사용할 기능에 맞는 모듈을 모셔와야 한다. 코드에 그게 생략되는 경우가 있는데(부분삽입의 경우) from ~ import ~가 생략된 것이다. 


일단 오늘의 시범조교를 모셔보자. 

ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA

At5g40780의 CDS 시퀀스이다. (코드블럭은 있는데 wrap이 안돼서 걍 생으로 붙였음)

이걸 시퀀스 오브젝트로 생성하는 방법도 간단하다. 그냥 모듈 모셔온 다음 Seq=("")로 감싸버리면 된다. 

at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA")
print(at5g40780)

이런 식으로. 

print(type(at5g40780))
<class 'Bio.Seq.Seq'>

누가봐도 시퀀스다. 

엥 근데 저거 걍 텍스트랑 뭔 차이임? Biopython을 통해 시퀀스로 생성한 데이터는 complement sequence와 reverse complement sequence를 만들 수 있다. 즉, 시퀀스로 불러온 CDS에 상보적인 시퀀스를 명령어 한 방에 만들 수 있는 것이다. 텍스트는 저런거 안된다. 

from Bio.Seq import Seq
at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT")
print("Sequence",at5g40780)
print("Complement sequence",at5g40780.complement())
print("Reverse complement sequence",at5g40780.reverse_complement())

이렇게 입력하면 

Sequence ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT
Complement sequence TACCATCGAGTTCGAGGAGTACTACTAGTAGTCCTACTACTCTTTAATCGTCGTCGCTCTGTTTTTCTCTAGCTTCTAACCAATGGTTAATGAAGTAGTTCTTTACGTTTCACCA
Reverse complement sequence ACCACTTTGCATTTCTTGATGAAGTAATTGGTAACCAATCTTCGATCTCTTTTTGTCTCGCTGCTGCTAATTTCTCATCATCCTGATGATCATCATGAGGAGCTTGAGCTACCAT

짜자잔 

complement는 CDS 시퀀스에 그냥 상보적인거고, reverse complement는 그 상보적인 시퀀스 순서를 뒤집은 모양이다. 


아니 근데 DNA가 겁나 바이트가 길어요... 저 시범조교 CDS도 1300자가 넘는데 어느 세월에 그거 복붙할거임... 그래서 FASTA와 Genbank의 데이터를 직접 열어볼 것이다. 

FASTA 시범조교는 PDB ID 2B10, cytochrome c peroxiase이다. 

>2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTSGTWDKHDNTGGSYGGTYRFKKEFNDPSNAGLQNGFKFLEPIHKEFPWISSGDLFSLGGVTAVQEMQGPKIPWRCGRVDTPEDTTPDNGRLPDADKDADYVRTFFQRLNMNDREVVALMGAHALGKTHLKNSGYEGPWGAANNVFTNEFYLNLLNEDWKLEKNDANNEQWDSKSGYMMLPTDYSLIQDPKYLSIVKEYANDQDKFFKDFSKAFEKLLENGITFPKDAPSPFIFKTLEEQGL
>2B10_2|Chains B, D|Cytochrome c iso-1|Saccharomyces cerevisiae (4932)
TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYTDANIKKNVLWDENNMSEYLTNPKKYIPGTKMASGGLKKEKDRNDLITYLKKACE

FASTA 파일 형식은 이렇게 생겼다. 일단 저 파일을 어떻게 가져올거냐... PDB 가서 2B10으로 검색하고 FASTA 파일을 받아보자. 

for seq_record in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta","fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
2B10_1|Chains
Seq('TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTS...QGL')
294
2B10_2|Chains
Seq('TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYT...ACE')
108

FASTA 파일은 이런 식으로 가져올 수 있다. (seqIO 데려오면 된다) 예? 경로요? 아니 이거 리눅슨디? 

사실 분량때문에 넣지는 못하지만 

print(seq_record)

만 치면 전체 파일을 볼 수 있다. 처음 하시는 분들은 전체를 보고 저 예제를 따라해보는 걸 추천한다. 


for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb","genbank"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

Genbank에서 뭐 검색하고 그거 gk(gbk)파일로 받으면 그게 genbank 파일이다. 이것도 일단 다운받아서 불러와야 한다. 웹에 있는거요? 그거 바로 못갖고온다. 

아무튼 저 코드를 실행하면 

NM_000517.6
Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA')
576

이렇게 된다. 그리고 얘도 FASTA와 마찬가지로 

print(seq_record)

를 주면 전체 파일을 볼 수는 있는데 분량이 어유...