(19)

Biopython SeqIO 써보기

파이참 쓰다보니 키보드 먹통돼서 재부팅 여러번 했다. 이것도 갈 때 된 듯... *참고로 그냥 리드나 파싱으로 긁어오는 방법은 첫 글에서 썼으므로 생략한다. Iteration 활용하기 근데 들어도 이건 뭔 기능인가 싶긴 함. from Bio import SeqIO identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")] print(identifiers) 뭐야 wrap 넣어줘요... 레코드와 for문을 조합해서 원하는 데이터(ID나 시퀀스, description)만 가져올 수 있다. from Bio import SeqIO identifiers = [seq_record.id fo..

Biopython으로 시퀀스 레코드 생성하고 만져보기

오늘껀 근데 하면서 나도 이해 못했음... SeqRecord를 만들려면 뭘 또 모셔와야 하는데 바로 from Bio.SeqRecord import SeqRecord 이놈이다. 이정도면 복잡한 거 돌리려면 아주 모셔오는데만 열댓줄 쓰게 생겼는데??? 텐서플로우나 저기 그 넘파이 판다스처럼 합쳐버리지 그러냐... (걔네는 일단 한 번 데려오면 다 써먹을 수 있음) 레코드 생성하기 이모지 넣고싶은데 이거 리눅스라 이모지 넣는 법을 모름... 아무튼 레코드는 이렇게 생성하면 된다. from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq EcoRI=Seq("GAATTC") # ECoRI sequence ECoRI_r=SeqRecord(EcoRI) #로 레코드를 ..

Biopython으로 시퀀스 다뤄보기

전에요? 아니 그건 가져온거고. 시퀀스도 텍스트나 리스트처럼 인덱싱과 슬라이싱이 된다. 방법도 똑같다. 그래서 그건 생략할 예정임... 그리고 그거 아니어도 분량이 많아요 또. .count() 특정 문자열이 몇 개인지 세 준다. 뭐 시퀀스에서 아데닌이나 구아닌 갯수 세주고 그런거다. 이걸 이용하면 특정 단백질 시퀀스에서 특정 아미노산(카테고리면 겁나 세야된다...)이 차지하는 비율도 볼 수 있다. 쉬어가는 코너-Primer에서 GC함량 구하기 아 이거 중요합니다. Primer 만들 때 중요한 요소 중 하나가 GC 함량이다. 참고로 여기서는 두 가지 방법으로 구해볼건데 첫번째는 .count()를 이용해 G와 C의 수를 세서 전체 DNA 염기 수로 나누는 거고, 두번째는 바이오파이썬의 모듈을 이용하는 방법이..

Biopython으로 시퀀스 가져오기

참고로 다른 코딩글은 여기다가 안 올린다. 코드블럭도 없고 카테고리도 없어서 어지간한 코딩 이야기는 노션과 워드프레스에 올리는 중... 아니면 미디움이나. 근데 이건 우째도 올렸네? 아 생물학 카테고리가 있잖아요. Biopython은 생물정보학에서 쓰는건데 마침 심심하던 차에 잘됐다 써봐야징 해서 어제 깔았다. 리눅스의 경우 pip install bio(안되면 biopython)으로 걍 터미널에서 깔면 된다(우분투 20.02 LTS 기준). 윈도우마냥 pip 있는 데 안 찾아가도 됨 ㄹㅇ 편함. 근데 파이참 바로가기 없는 건 너무한 거 아니냐고... 내가 터미널 열고 직접 모시러 가야것냐... *Notes: Biopython을 사용하려면 사용할 기능에 맞는 모듈을 모셔와야 한다. 코드에 그게 생략되는 경..

Biopython SeqIO 써보기

Coding/Python 2022. 8. 20. 02:43

파이참 쓰다보니 키보드 먹통돼서 재부팅 여러번 했다. 이것도 갈 때 된 듯... 


*참고로 그냥 리드나 파싱으로 긁어오는 방법은 첫 글에서 썼으므로 생략한다. 

 

Iteration 활용하기

근데 들어도 이건 뭔 기능인가 싶긴 함. 

 

from Bio import SeqIO
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")]
print(identifiers)

뭐야 wrap 넣어줘요... 

레코드와 for문을 조합해서 원하는 데이터(ID나 시퀀스, description)만 가져올 수 있다. 

from Bio import SeqIO
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")]
print(identifiers)
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', 'Z78526.1', 'Z78525.1', 'Z78524.1', 'Z78523.1', 'Z78522.1', 'Z78521.1', 'Z78520.1', 'Z78519.1', 'Z78518.1', 'Z78517.1', 'Z78516.1', 'Z78515.1', 'Z78514.1', 'Z78513.1', 'Z78512.1', 'Z78511.1', 'Z78510.1', 'Z78509.1', 'Z78508.1', 'Z78507.1', 'Z78506.1', 'Z78505.1', 'Z78504.1', 'Z78503.1', 'Z78502.1', 'Z78501.1', 'Z78500.1', 'Z78499.1', 'Z78498.1', 'Z78497.1', 'Z78496.1', 'Z78495.1', 'Z78494.1', 'Z78493.1', 'Z78492.1', 'Z78491.1', 'Z78490.1', 'Z78489.1', 'Z78488.1', 'Z78487.1', 'Z78486.1', 'Z78485.1', 'Z78484.1', 'Z78483.1', 'Z78482.1', 'Z78481.1', 'Z78480.1', 'Z78479.1', 'Z78478.1', 'Z78477.1', 'Z78476.1', 'Z78475.1', 'Z78474.1', 'Z78473.1', 'Z78472.1', 'Z78471.1', 'Z78470.1', 'Z78469.1', 'Z78468.1', 'Z78467.1', 'Z78466.1', 'Z78465.1', 'Z78464.1', 'Z78463.1', 'Z78462.1', 'Z78461.1', 'Z78460.1', 'Z78459.1', 'Z78458.1', 'Z78457.1', 'Z78456.1', 'Z78455.1', 'Z78454.1', 'Z78453.1', 'Z78452.1', 'Z78451.1', 'Z78450.1', 'Z78449.1', 'Z78448.1', 'Z78447.1', 'Z78446.1', 'Z78445.1', 'Z78444.1', 'Z78443.1', 'Z78442.1', 'Z78441.1', 'Z78440.1', 'Z78439.1']

예제에 있단 파일로 돌렸더니 94개 나온다. 근데 94개가 저렇게 중구난방으로 나오니까 힘들다... 이거 넘파이 배열로 바꿔서 깔쌈하게 보여주면 안되나? 

from Bio import SeqIO
import numpy as np
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")]
identifiers=np.array(identifiers)
identifiers=identifiers.reshape(2,47)
print(identifiers)

그래서 해봤다. 

[['Z78533.1' 'Z78532.1' 'Z78531.1' 'Z78530.1' 'Z78529.1' 'Z78527.1'
  'Z78526.1' 'Z78525.1' 'Z78524.1' 'Z78523.1' 'Z78522.1' 'Z78521.1'
  'Z78520.1' 'Z78519.1' 'Z78518.1' 'Z78517.1' 'Z78516.1' 'Z78515.1'
  'Z78514.1' 'Z78513.1' 'Z78512.1' 'Z78511.1' 'Z78510.1' 'Z78509.1'
  'Z78508.1' 'Z78507.1' 'Z78506.1' 'Z78505.1' 'Z78504.1' 'Z78503.1'
  'Z78502.1' 'Z78501.1' 'Z78500.1' 'Z78499.1' 'Z78498.1' 'Z78497.1'
  'Z78496.1' 'Z78495.1' 'Z78494.1' 'Z78493.1' 'Z78492.1' 'Z78491.1'
  'Z78490.1' 'Z78489.1' 'Z78488.1' 'Z78487.1' 'Z78486.1']
 ['Z78485.1' 'Z78484.1' 'Z78483.1' 'Z78482.1' 'Z78481.1' 'Z78480.1'
  'Z78479.1' 'Z78478.1' 'Z78477.1' 'Z78476.1' 'Z78475.1' 'Z78474.1'
  'Z78473.1' 'Z78472.1' 'Z78471.1' 'Z78470.1' 'Z78469.1' 'Z78468.1'
  'Z78467.1' 'Z78466.1' 'Z78465.1' 'Z78464.1' 'Z78463.1' 'Z78462.1'
  'Z78461.1' 'Z78460.1' 'Z78459.1' 'Z78458.1' 'Z78457.1' 'Z78456.1'
  'Z78455.1' 'Z78454.1' 'Z78453.1' 'Z78452.1' 'Z78451.1' 'Z78450.1'
  'Z78449.1' 'Z78448.1' 'Z78447.1' 'Z78446.1' 'Z78445.1' 'Z78444.1'
  'Z78443.1' 'Z78442.1' 'Z78441.1' 'Z78440.1' 'Z78439.1']]

참고로 numpy array는 5*6을 만들려면 진짜 데이터가 30개 있어야된다. 즉, 저게 최선이다. 

 

레코드 반복

이것도 Iteration을 이용한다. 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta")

first_record = next(record_iterator)
print(first_record.id)
print(first_record.description)

second_record = next(record_iterator)
print(second_record.id)
print(second_record.description)
from Bio import SeqIO
first_record_2 = next(SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"))
print(first_record_2)

위 코드와 아래 코드 둘 다 iteration을 활용하는 건 맞다. 

2B10_1|Chains
2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
2B10_2|Chains
2B10_2|Chains B, D|Cytochrome c iso-1|Saccharomyces cerevisiae (4932)
ID: 2B10_1|Chains
Name: 2B10_1|Chains
Description: 2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
Number of features: 0
Seq('TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTS...QGL')

위 코드는 next 수가 두 개라 앞에서 두 데이터+원하는 부분만 띄운 거고 아래는 판다스의 head()마냥 맨 위에걸 띄운거다. 아니 그러면 next를 일일이 쳐야되나요? for문 안됨? 있어봐요 해봅시다. 

 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.fasta", "fasta")
for i in range(0,4):
    record = next(record_iterator)
    print(record.id)
    print(record.description)
gi|2765658|emb|Z78533.1|CIZ78533
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765657|emb|Z78532.1|CCZ78532
gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765656|emb|Z78531.1|CFZ78531
gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765655|emb|Z78530.1|CMZ78530
gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA

되는데요? 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.fasta", "fasta")

i=0
while i<4
    record = next(record_iterator)
    print(record.id)
    print(record.description)
    i=i+1

참고로 While문도 된다. 

 

레코드 목록 뽑기

from Bio import SeqIO
records = list(SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank"))
print(records)

이게 무옵션 기본 코드이다. 

print("First record:",records[0])
print("Last record:",records[-1])

인덱싱도 된다. 

 

first=records[0]
last=records[-1]
# 변수 안 주면 풀로 나온다...
print("First record ID:",first.id)
print("Last record ID:",last.id)

인덱싱 하면서 특정 인자만 보는 것도 된다. (위 코드는 처음과 맨 마지막 데이터의 아이디를 보여주는 코드)

 

데이터 추출

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")
first_record = next(record_iterator)
# 첫 번째 레코드를 가져온다.
print(first_record.annotations) # Annotations
print(first_record.annotations.keys()) # Key
print(first_record.annotations.values()) # Values
{'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '30-NOV-2006', 'accessions': ['Z78533'], 'sequence_version': 1, 'gi': '2765658', 'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'source': 'Cypripedium irapeanum', 'organism': 'Cypripedium irapeanum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium'], 'references': [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]} 
# 그냥 주석을 출력했을 때
dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'gi', 'keywords', 'source', 'organism', 'taxonomy', 'references'])
# 키 주세요
dict_values(['DNA', 'linear', 'PLN', '30-NOV-2006', ['Z78533'], 1, '2765658', ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'Cypripedium irapeanum', 'Cypripedium irapeanum', ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium'], [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]])
# 값 주세요

Annotation은 저장 형태가 딕셔너리라 이렇게 나온다고... 딕셔너리 형태라는 건 Key:value로 저장되어 있기 때문에 키값이나 밸류값만 볼 수 있다는 얘기이기도 하다. 

 

print(first_record.annotations["source"]) # Annotation+selector(키 중 하나를 선택)
Cypripedium irapeanum # 결과

이런 식으로 키 중 하나를 선택해서 거기에 대한 밸류를 볼 수 있다. 

 

# 아래 코드는 리스트업 코드이다.
all_species = []
for seq_record in record_iterator:
    all_species.append(seq_record.annotations["organism"])
print(all_species)

숫자만 맞으면 어레이 만들어도 된다. 

all_species = [
    seq_record.annotations["organism"]
    for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")
]
print(all_species)

for문에 던져버려도 되고. 

 

수정

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10_mod.fasta", "fasta") # 불러와서(사본 만듦)
first_record = next(record_iterator)
first_record.id="SC_2B10_1" # ID를 바꿨다.
print(first_record.id)

원본 파일에 손대기 좀 그래서 사본 만들어서 수정했다. 수정은 이런 식으로 하면 된다. 

 

웹에서 받아오기

아 솔직치 다운받기 귀찮은데 아이디 알면 걍 웹에서 뜯어오면 안돼요? 됨. 

 

파일 주세요 Genbank여... 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="fasta", retmode="text", id="60391722"
) as handle:
    seq_record = SeqIO.read(handle, "fasta")
print(seq_record)

순순히 파일을 내놓는다면 유혈사태는 면할 것이다! (메일 내꺼임)

ID: AB018076.3
Name: AB018076.3
Description: AB018076.3 Homo sapiens hedgehog gene, complete cds
Number of features: 0
Seq('ATGTCTCCCGCCCGGCTCCGGCCCCGACTGCACTTCTGCCTGGTCCTGTTGCTG...ACC')

그래서 받아옴

 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="60391722"
) as handle:
    seq_record = SeqIO.read(handle, "gb")
print(seq_record)

파일 또 내놔! 

 

ID: AB018076.3
Name: AB018076
Description: Homo sapiens hedgehog gene, complete cds
Number of features: 8
/molecule_type=DNA
/topology=linear
/data_file_division=PRI
/date=24-FEB-2005
/accessions=['AB018076', 'AB010092', 'AB018075']
/sequence_version=3
/keywords=['']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Expression of Sonic hedgehog and its receptor Patched/Smoothened in human cancer cell lines and embryonic organs', ...), Reference(title='Direct Submission', ...)]
/comment=On or before Mar 1, 2005 this sequence version replaced AB010092.1,
AB018075.1, AB018076.2.
Seq('ATGTCTCCCGCCCGGCTCCGGCCCCGACTGCACTTCTGCCTGGTCCTGTTGCTG...ACC')

아, 이건 genbank 포맷이다. 파스타랑 저거랑 다 된다. 단백질 이름이요? 아니 진짜 있는건데??? 

 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="60391722,60391706,1519245148" # 여러개도 된다.
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        print(seq_record.description)
Homo sapiens hedgehog gene, complete cds
Homo sapiens hedgehog gene, complete cds
Homo sapiens sonic hedgehog signaling molecule (SHH), transcript variant 1, mRNA

아 물론 뭉텅이로 받는것도 된다. 

 

파일 주세요 swissprot이여... 

솔직히 이정도면 PDB랑 유니프롯이랑 TAIR랑 펍켐도 지원해줘야된다. 개발자 일해라. 

 

from Bio import ExPASy
from Bio import SeqIO
with ExPASy.get_sprot_raw("O23729") as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record)

스위스프롯에서는 이렇게 가져온다. 

ID: O23729
Name: CHS3_BROFI
Description: RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
Database cross-references: EMBL:AF007097, SMR:O23729, PRIDE:O23729, UniPathway:UPA00154, GO:GO:0016210, GO:GO:0009813, Gene3D:3.40.47.10, InterPro:IPR012328, InterPro:IPR001099, InterPro:IPR018088, InterPro:IPR011141, InterPro:IPR016039, PANTHER:PTHR11877, Pfam:PF02797, Pfam:PF00195, PIRSF:PIRSF000451, SUPFAM:SSF53901, PROSITE:PS00441
Number of features: 2
/molecule_type=protein
/accessions=['O23729']
/protein_existence=2
/date=15-JUL-1999
/sequence_version=1
/date_last_sequence_update=01-JAN-1998
/date_last_annotation_update=10-FEB-2021
/entry_version=75
/gene_name=Name=CHS3;
/organism=Bromheadia finlaysoniana (Orchid)
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliopsida', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Epidendroideae', 'Vandeae', 'Adrorhizinae', 'Bromheadia']
/ncbi_taxid=['41205']
/comment=FUNCTION: The primary product of this enzyme is 4,2',4',6'- tetrahydroxychalcone (also termed naringenin-chalcone or chalcone) which can under specific conditions spontaneously isomerize into naringenin.
CATALYTIC ACTIVITY: Reaction=4-coumaroyl-CoA + 2 H(+) + 3 malonyl-CoA = 2',4,4',6'-   tetrahydroxychalcone + 3 CO2 + 4 CoA; Xref=Rhea:RHEA:11128,   ChEBI:CHEBI:15378, ChEBI:CHEBI:16526, ChEBI:CHEBI:57287,   ChEBI:CHEBI:57355, ChEBI:CHEBI:57384, ChEBI:CHEBI:77645; EC=2.3.1.74;   Evidence={ECO:0000255|PROSITE-ProRule:PRU10023};
PATHWAY: Secondary metabolite biosynthesis; flavonoid biosynthesis.
SIMILARITY: Belongs to the thiolase-like superfamily. Chalcone/stilbene synthases family. {ECO:0000305}.
/references=[Reference(title='Molecular cloning and sequence analysis of chalcone synthase cDNAs of Bromheadia finlaysoniana.', ...)]
/keywords=['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE')

뭔진 모르겠지만 일단 잘못했어요

참고로 저거 치니까 유니프롯(Uniprot) 나오길래 여기서 아이디 제공해주는건가 했는데 저기다가 유니프롯 아이디 넣으니까 됨. 

 

from Bio import ExPASy
from Bio import SeqIO
with ExPASy.get_sprot_raw("Q14005") as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record.name)
print(seq_record.description)

Q14005: Interleukin-16

 

IL16_HUMAN
RecName: Full=Pro-interleukin-16; Contains: RecName: Full=Interleukin-16; Short=IL-16; AltName: Full=Lymphocyte chemoattractant factor; Short=LCF;

뭐야 왜 돼요 ㄷㄷ 

 

딕셔너리화하기

뭐 하면 뭐 좋은갑지... 모르것다... (뇌클리어)

 

SeqIO.to_dict

from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank"))
print(orchid_dict.keys())
print(orchid_dict.values())
dict_keys(['NM_000517.6'])
dict_values([SeqRecord(seq=Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA'), id='NM_000517.6', name='NM_000517', description='Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA', dbxrefs=[])])

이런 식으로 딕셔너리화할 수 있고 

 

print(orchid_dict['2B10_1|Chains'].description)
2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)

이런 식으로 셀렉터 적용도 된다. 

 

for record in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"):
    print(record.id, seguid(record.seq))

SEGUID checksum 함수를 적용시켜 딕셔너리화 할 수도 있다는데... 

2B10_1|Chains kSIi2w+CcP3y/k3bFRVhvvWtCtM
2B10_2|Chains SA4dH3bOSXcG43088Houu01TH2U

이건 키가 일단 너무 저세상이다. InChl key 못지 않음... 

seguid_dict = SeqIO.to_dict(SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"),lambda rec: seguid(rec.seq))
record = seguid_dict["kSIi2w+CcP3y/k3bFRVhvvWtCtM"]
print(record.id)
2B10_1|Chains

그러니까 저 함수 적용하고 저 키를 다 외우고 있으면 찾는 게 가능하다는 얘기. 

 

SeqIO.index

from Bio import SeqIO
orchid_dict = SeqIO.index("/home/koreanraichu/Octodon degus insulin.gb", "genbank")
print(orchid_dict.keys())
print(orchid_dict.values())
KeysView(SeqIO.index('/home/koreanraichu/Octodon degus insulin.gb', 'genbank', alphabet=None, key_function=None))
ValuesView(SeqIO.index('/home/koreanraichu/Octodon degus insulin.gb', 'genbank', alphabet=None, key_function=None))

예제 보면 리스트로 잘만 나오던데 나는 왜 저따구여? 

 

쓰기

언제까지 긁어만 올텐가... 

 

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

rec1 = SeqRecord(
    Seq(
        "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
        "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
        "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
        "SSAC",
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
)

rec2 = SeqRecord(
    Seq(
        "YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
        "DMVVVEIPKLGKEAAVKAIKEWGQ",
    ),
    id="gi|13919613|gb|AAK33142.1|",
    description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)

rec3 = SeqRecord(
    Seq(
        "MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
        "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
        "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
        "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
        "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
        "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
        "TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
    ),
    id="gi|13925890|gb|AAK49457.1|",
    description="chalcone synthase [Nicotiana tabacum]",
)

my_records = [rec1, rec2, rec3]

from Bio import SeqIO
SeqIO.write(my_records, "/home/koreanraichu/example.fasta", "fasta")

아니 이거 내가 IL로 트라이했는데 에러났음... 예제껀 잘됐는데 뭔 타입에러가 반기던데? 

아무튼 이런 식으로 저장된다. (메시지 출력 따로 안했다)

 

from Bio import SeqIO

records = SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")
count = SeqIO.write(records, "/home/koreanraichu/my_example.fasta", "fasta")
print("Converted %i records" % count)

count2 = SeqIO.convert("/home/koreanraichu/sequence.gb", "genbank", "/home/koreanraichu/my_example2.fasta", "fasta")
print("Converted %i records" % count2)

긁어오는 방법은 두 가지고 둘 다 된다. 

from Bio import SeqIO
records = [rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement")
           for rec in SeqIO.parse("/home/koreanraichu/at5g40780.gb", "genbank")]
SeqIO.write(records, "/home/koreanraichu/at5g40780_rc.fasta", "fasta")

DNA의 경우 reverse complement sequence로 저장하는 것도 된다. (단백질은 하지말자...)

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 레코드 생성하고 만져보기

Coding/Python 2022. 8. 20. 02:30

오늘껀 근데 하면서 나도 이해 못했음... 


SeqRecord를 만들려면 뭘 또 모셔와야 하는데 바로 

from Bio.SeqRecord import SeqRecord

이놈이다. 

이정도면 복잡한 거 돌리려면 아주 모셔오는데만 열댓줄 쓰게 생겼는데??? 텐서플로우나 저기 그 넘파이 판다스처럼 합쳐버리지 그러냐... (걔네는 일단 한 번 데려오면 다 써먹을 수 있음) 


레코드 생성하기

이모지 넣고싶은데 이거 리눅스라 이모지 넣는 법을 모름... 아무튼 레코드는 이렇게 생성하면 된다. 

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
EcoRI=Seq("GAATTC") # ECoRI sequence
ECoRI_r=SeqRecord(EcoRI) #로 레코드를 만들어보겠습니다 
print(EcoRI)
print(ECoRI_r)

그러면 어떻게 나오느냐 

GAATTC # 시퀀스
ID: <unknown id> # 여기서부터 레코드
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('GAATTC')

이렇게 나온다. 뭐가 많이 빠졌지만 일단 레코드 맞다. 여기다 넣고싶으면 일단 다음 문단으로 가시고... 아니 그럼 생성할 때 넣을 수 있나요? 예. 됩니다. 

HaeIII_r=SeqRecord(HaeIII,id="RES_HAEIII",name="HaeIII restriction site",description="Restriction of HaeIII, cuts GG/CC")

노션도 되는 wrap이 안되는 코드블록... 네이버 니네 점검하면 대체 뭐하냐... 아무튼 이렇게 만들 때 넣으면 

ID: RES_HAEIII
Name: HaeIII restriction site
Description: Restriction of HaeIII, cuts GG/CC
Number of features: 0
Seq('GGCC')

이렇게 생성된다. 

 

레코드에 뭔가 넣기

아니 아까 만든거에 뭐 좀 어떻게 넣어주면 안됩니까? 아뇨 됩니다. 

# 레코드에 정보를 추가해보자
ECoRI_r.id="RES_ECORI"
ECoRI_r.name="ECoRI restriction site"
ECoRI_r.description="Restriction site of ECoRI, cuts G/AATTC"
# ID랑 이름, description 정보를 추가했다.

이렇게 필요한 정보를 넣어줄 수 있다. (참고: EcoRI이 맞는 표기다)

ID: RES_ECORI
Name: ECoRI restriction site
Description: Restriction site of ECoRI, cuts G/AATTC
Number of features: 0
Seq('GAATTC')

제가 된다고 했잖아여... 

HaeIII_r.annotations["notes"]="갑자기 해삼이 먹고싶다. " # annotation
HaeIII_r.letter_annotations['letter annotations']=['G','G/','C','C'] # letter annotations
/notes=갑자기 해삼이 먹고싶다. 
Per letter annotation for: letter annotations

주석도 넣을 수 있다. 참고로 HaeIII은 Haemophilus influenzae biogroup aegyptius에서 유래한 제한효소. 본인은 해삼 혹은 해쓰리로 읽는다. (자매품 XbaI...)

 

FASTA에서 긁어오기

FASTA와 Genbank 파일에서 레코드를 긁어올 수도 있다. 후자쪽이 정보는 많음. 

from Bio import SeqIO
record=SeqIO.read('/home/koreanraichu/rcsb_pdb_7PNN.fasta','fasta')
print(record)
record=SeqIO.parse('/home/koreanraichu/rcsb_pdb_2B10.fasta','fasta')

FASTA 파일에 하나 들어있으면 read, 두 개 이상이면 parse로 긁어야 한다. (두 개 이상인데 read로 긁으면 ValueError: More than one record found in handle가 당신을 반길 것이다)

 

print(record.id)
7PNN_1|Chain
for record2 in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta","fasta"):
    print(record2.id)
2B10_1|Chains
2B10_2|Chains

긁는 난이도치고 접근은 되게 쉽다. 근데 parse로 긁어오면 for문 써야 나온다... 

 

Genbank에서 긁어오기

record=SeqIO.read("/home/koreanraichu/sequence.gb",'genbank')
print(record)
print(record.description)
Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA

얘는 리드로 파스고 없이 걍 긁어오면 접근이 바로 된다. 편한데? 

 

Feature

시퀀스 레코드의 정보를 담고 있다고 하는데 뭔지 모르겠는 건 기분탓이 아니다. 여기서 볼 것은 .type와 .location, .qualifiers이다. 

 

.type

for feature in record.features:
    print(feature.type)
source
gene
CDS
sig_peptide
mat_peptide
mat_peptide
mat_peptide
regulatory
polyA_site

얘도 for문 돌려야 나온다. 타입은 말 그대로 타입. 

.location

for feature in record.features:
    print(feature.location)
[0:432](+)
[0:432](+)
[41:371](+)
[41:113](+)
[113:200](+)
[206:293](+)
[299:368](+)
[413:419](+)
[431:432](+)

Feature의 위치 정보. +는 뭐임? 

 

.qualifiers

for feature in record.features:
    print(feature.qualifiers)
OrderedDict([('organism', ['Octodon degus']), ('mol_type', ['mRNA']), ('db_xref', ['taxon:10160']), ('tissue_type', ['pancreas'])])
OrderedDict([('gene', ['insulin'])])
OrderedDict([('gene', ['insulin']), ('codon_start', ['1']), ('product', ['insulin']), ('protein_id', ['AAA40590.1']), ('translation', ['MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLEAGGLQPSALEMILQKRGIVDQCCNNICTFNQLQNYCNVP'])])
OrderedDict([('gene', ['insulin'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin B-chain'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin C-peptide'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin A-chain'])])
OrderedDict([('regulatory_class', ['polyA_signal_sequence']), ('gene', ['insulin'])])
OrderedDict([('gene', ['insulin'])])

Feature를 딕셔너리 형태로 저장해둔 것. Octodon degus는 데덴네의 모티브가 되기도 한 데구의 학명이다. 

 

Position과 location

Feature의 위치 정보. Position은 0차원(점)이고 Location은 1차원(선)이다.

location은 포지션 두 개가 있어야된다. 여부터 여까지라는 얘기니께...

from Bio import SeqFeature
start_pos=SeqFeature.ExactPosition(15)
end_pos=SeqFeature.ExactPosition(30)
location=SeqFeature.FeatureLocation(start_pos, end_pos)
print(start_pos,end_pos,location)
15 30 [15:30]
start_pos2 = SeqFeature.AfterPosition(1)
end_pos2 = SeqFeature.BetweenPosition(9, left=8, right=9)
my_location = SeqFeature.FeatureLocation(start_pos2, end_pos2)
print(start_pos2,end_pos2,my_location)
>1 (8^9) [>1:(8^9)]

30쓰니까 에러뜸... 

 

포지션이 명확할 때도 있고 명확하지 않을 때도 있다. 무슨 양자역학이냐... 슈뢰딩거의 포지션

 

start_pos2 = SeqFeature.AfterPosition(1)
end_pos2 = SeqFeature.BeforePosition(8)
my_location = SeqFeature.FeatureLocation(start_pos2, end_pos2)
print(start_pos2,end_pos2,my_location)
>1 <8 [>1:<8]

참고로 애프터가 있다는 것은 비포가 있다는 얘기이기도 하다. (대충 펀쿨섹좌 짤)

 

실제 파일에 적용해보기

ID: NM_000517.6
Name: NM_000517
Description: Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
Number of features: 44
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=06-SEP-2021
/accessions=['NM_000517']
/sequence_version=6
/keywords=['RefSeq', 'MANE Select']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Molecular and Hematological Analysis of Alpha- and Beta-Thalassemia in a Cohort of Mexican Patients', ...), Reference(title='Molecular Characterization and Hematological Aspects of Hb E-Myanmar [beta26(B8)Glu-->Lys and beta65(E9)Lys-->Asn, HBB: c.[79G>A;198G>C]): A Novel beta-Thalassemic Hemoglobin Variant', ...), Reference(title='delta-Globin Chain Variants Associated with Decreased Hb A2 Levels: A National Reference Laboratory Experience', ...), Reference(title='Heterozygosity for the Novel HBA2: c.*91_*92delTA Polyadenylation Site Variant on the alpha2-Globin Gene Expanding the Genetic Spectrum of alpha-Thalassemia in Iran', ...), Reference(title='Identification and characterization of two novel and differentially expressed isoforms of human alpha2- and alpha1-globin genes', ...), Reference(title='Alpha-Thalassemia', ...), Reference(title="Cloning and complete nucleotide sequence of human 5'-alpha-globin gene", ...), Reference(title='A new abnormal human hemoglobin: Hb Prato (alpha 2 31 (B12) Arg leads to Ser beta 2)', ...), Reference(title='Hemoglobin Suan-Dok (alpha 2 109 (G16) Leu replaced by Arg beta 2): an unstable variant associated with alpha-thalassemia', ...), Reference(title='Hemoglobin Tarrant: alpha126(H9) Asp leads to Asn. A new hemoglobin variant in the alpha1beta1 contact region showing high oxygen affinity and reduced cooperativity', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from V00493.1, Z84721.1 and
AI016696.1.
This sequence is a reference standard in the RefSeqGene project.
On Aug 3, 2018 this sequence version replaced NM_000517.5.
Summary: The human alpha globin gene cluster located on chromosome
16 spans about 30 kb and includes seven loci: 5'- zeta - pseudozeta
- mu - pseudoalpha-1 - alpha-2 - alpha-1 - theta - 3'. The alpha-2
(HBA2) and alpha-1 (HBA1) coding sequences are identical. These
genes differ slightly over the 5' untranslated regions and the
introns, but they differ significantly over the 3' untranslated
regions. Two alpha chains plus two beta chains constitute HbA,
which in normal adult life comprises about 97% of the total
hemoglobin; alpha chains combine with delta chains to constitute
HbA-2, which with HbF (fetal hemoglobin) makes up the remaining 3%
of adult hemoglobin. Alpha thalassemias result from deletions of
each of the alpha genes as well as deletions of both HBA2 and HBA1;
some nondeletion alpha thalassemias have also been reported.
[provided by RefSeq, Jul 2008].
Publication Note:  This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Gene
record to access additional publications.
COMPLETENESS: full length.
/structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'AI816040.1, AI815806.1 [ECO:0000332]'), ('RNAseq introns', 'single sample supports all introns SAMEA1965299, SAMEA1966682 [ECO:0000348]')])), ('RefSeq-Attributes', OrderedDict([('MANE Ensembl match', 'ENST00000251595.11/ ENSP00000251595.6'), ('RefSeq Select criteria', 'based on single protein-coding transcript')]))])
Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA')

오늘의 시범조교. 뭐야 wrap 돌려줘요.

 

record=SeqIO.read("/home/koreanraichu/sequence.gb",'genbank') # 여기서 뭘 뽑아봅시다
index=575
for feature in record.features:
    if index in feature:
        print("%s, %s, %s" %(feature.type,feature.location,feature.qualifiers.get("db_xref")))

여기서 575가 들어가는 feature를 찾아보자. 

 

source, [0:576](+), ['taxon:9606']
gene, [0:576](+), ['GeneID:3040', 'HGNC:HGNC:4824', 'MIM:141850']
exon, [337:576](+), None

아 그래서 이렇게 나온건가... (아직도 이해 못함) 

 

SeqFeature로 시퀀스의 특정 부분만 잘라서 보기

from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation
at5g40780=Seq("MVAQAPHDDHQDDEKLAAARQKEIEDWLPITSSRNAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVITLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMWRGVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQIYAMPVFDMMETLLVKKLNFRPTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGLRTIVIQAKGYKFYS")
# 단백질 시퀀스 왜 이렇게 길어 이거
feature = SeqFeature(FeatureLocation(0, 30), type="protein", strand=1)
feature_seq = at5g40780[feature.location.start:feature.location.end]
print(feature_seq)

TAIR에서 돌아온 시범조교, LHT1의 단백질 시퀀스다. 뭐요. 파이참 wrap 일일이 켜기도 귀찮은데 걍 혀... 아무튼 feature라는 변수를 통해 30번째까지 볼 거라는 얘기다. strand가 -1이면 상보적인 시퀀스를 추출한다는데 저거 단백질이라 의미 없음. 

 

feature_seq2 = feature.extract(at5g40780)
print(feature_seq2)

Feature 변수 주고 extract해도 된다. 

 

레코드 비교하기

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record1=SeqRecord(Seq("GAATTC"), id="ECoR1",description="sticky")
record2=SeqRecord(Seq("GGCC"), id="HaeIII",description="blunt")
record3=SeqRecord(Seq("CTCGAG"), id="XhoI",description="sticky")
record4=SeqRecord(Seq("GATC"), id="DpnII",description="sticky")

이번 시범조교는 얘네들이다. 

레코드 비교? 그럼 다이렉트로 비교 되나요? 그러면 NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest. 에러가 당신을 반겨줄 것이다. 레코드 비교는 안에 든 걸 비교하는거지 레코드를 쌩으로 비교하라는 얘기가 아니다. 

 

print(record1.description==record3.description)
True

EcoRI과 XhoI의 description이 같으므로 참이다.

print(len(record1.seq)==len(record2.seq))
False

EcoRI과 HaeIII의 시퀀스 길이는 다르므로 false. 

 

포맷 만들기

아니 컴퓨터 포맷 아님. 

 

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record=SeqRecord(
    Seq(
        "ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA"),
id="at5g40780|LHT1",
description="Lysine-histidine transporter 1|Arabidopsis thaliana")
print(record.format("fasta"))

이런 식으로 입력하면 

 

>at5g40780|LHT1 Lysine-histidine transporter 1|Arabidopsis thaliana
ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGA
CAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTAC
TCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCC
ATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACA
CTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGAT
CGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTG
CCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAA
TCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTAT
TTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCC
ATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGG
GCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACA
ACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCG
GGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCA
AAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTC
CCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATG
TCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTC
ATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTC
AAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTT
GCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTT
GGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATC
TACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGT
CTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAA
GGATACAAGTTTTACTCATAA

이렇게 FASTA 포맷이 된다. 

 

슬라이싱

feature도 slicing이 된다. 어째서인지는 모르겠으나... 

 

from Bio import SeqIO
record = SeqIO.read("/home/koreanraichu/sequence.gb", "genbank")
print(record.features[20])

이건 근데 인덱싱 아니냐. 

 

sub_record = record[0:5]
print(sub_record)
ID: NM_000517.6
Name: NM_000517
Description: Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
Number of features: 0
/molecule_type=mRNA
Seq('ACTCT')

아무튼 레코드에서 0부터 5까지 뽑아보았다. 염기 6개가 나온다. 

 

record = SeqIO.read("/home/koreanraichu/sequence.gb", "genbank")
sub_record = record[0:576]
print(sub_record.features[0])

범위를 풀로 바꾸고 0번째 feature를 봤다. 

 

type: source
location: [0:576](+)
qualifiers:
    Key: chromosome, Value: ['16']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['16p13.3']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']

그러면 이렇게 나옵니다.

 

print(sub_record.format("genbank"))
print(sub_record.format("fasta"))

위 코드를 이용해 출력 포맷을 바꿀 수도 있다. 

>NM_000517.6 Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
ACTCTTCTGGTCCCCACAGA
# FASTA
LOCUS       NM_000517                 20 bp    mRNA             UNK 01-JAN-1980
DEFINITION  Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA.
ACCESSION   NM_000517
VERSION     NM_000517.6
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 actcttctgg tccccacaga
//
# Genbank

위는 FASTA, 아래는 Genbank 포맷. 

 

'Coding > Python' 카테고리의 다른 글

번외편-코딩테스트 풀이  (0) 2022.08.20
Biopython으로 MSA 해보기  (0) 2022.08.20
Biopython SeqIO 써보기  (0) 2022.08.20
Biopython으로 시퀀스 다뤄보기  (0) 2022.08.20
Biopython으로 시퀀스 가져오기  (0) 2022.08.20
Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 다뤄보기

Coding/Python 2022. 8. 20. 02:14

전에요? 아니 그건 가져온거고. 


시퀀스도 텍스트나 리스트처럼 인덱싱과 슬라이싱이 된다. 방법도 똑같다. 그래서 그건 생략할 예정임... 그리고 그거 아니어도 분량이 많아요 또. 

 

.count()

특정 문자열이 몇 개인지 세 준다. 뭐 시퀀스에서 아데닌이나 구아닌 갯수 세주고 그런거다. 이걸 이용하면 특정 단백질 시퀀스에서 특정 아미노산(카테고리면 겁나 세야된다...)이 차지하는 비율도 볼 수 있다. 

 

쉬어가는 코너-Primer에서 GC함량 구하기

아 이거 중요합니다. Primer 만들 때 중요한 요소 중 하나가 GC 함량이다. 참고로 여기서는 두 가지 방법으로 구해볼건데 첫번째는 .count()를 이용해 G와 C의 수를 세서 전체 DNA 염기 수로 나누는 거고, 두번째는 바이오파이썬의 모듈을 이용하는 방법이다. 

GC_cont=(primer.count('C')+primer.count('G'))/len(primer)
print(GC_cont*100,"%")
from Bio.SeqUtils import GC
primer=Seq('CAGCAAGCAAAGGTGTTCAA')
print(GC(primer),"%")

위는 직접 계산하는 방법이고 아래는 모듈을 사용했다. 모듈은 쓰려면 또 모셔와야 한다. 

 

시퀀스에 시퀀스를 더하면 1+1인가요 

시퀀스 그냥 +로 갖다 붙이거나 리스트업된 거 for문으로 붙이거나 join()쓰거나... 일단 for와 join 보고 갑시다. 

seq_list=[Seq('ATCC'),Seq('ATAT'),Seq('TNNN')]
add = Seq("")
for i in seq_list:
    add += i
print(add)
seq_list=[Seq('ATCC'),Seq('ATAT'),Seq('TNNN')]
spacer=Seq('N'*5)
print(spacer.join(seq_list))

 

돌연변이 만들기

시퀀스 데이터는 튜플마냥 안에 있는 내용물을 수정할 수 없다. 수정하려면 또 다른 모듈을 모셔와야 한다. 

from Bio.Seq import Seq
from Bio.Seq import MutableSeq #이걸 불러와서 
my_seq=Seq("GAATTC")
mutable_seq=MutableSeq(my_seq) #적용해줘야 한다
mutable_seq[0]="A"
print(mutable_seq)

Mutableseq에 던져둔 것은 타입도 

<class 'Bio.Seq.MutableSeq'>

로 바뀐다. 일반 시퀀스는 Seq. 

 

전사번역

입력한 시퀀스를 mRNA로 전사도 해 주고, 번역도 된다. 물론 다이렉트로 된다. 

 

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
at5g40780=Seq('ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA')
at5g40780_rc=at5g40780.reverse_complement()
# DNA와 상보적인 sequence를 만들었다.

at5g40780_mrna=at5g40780.transcribe()
at5g40780_mrna_rc=at5g40780_mrna.reverse_complement()
# 여기 RNA 추가요!
print(at5g40780_mrna)
AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA

.transcribe()를 이용하면 전사도 해 준다. 예시에 있는 시퀀스는 at5g40780(의 CDS). 

 

from Bio.Seq import Seq
at5g40780=Seq('ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA')
# DNA sequence

at5g40780_mrna=at5g40780.transcribe()
# Transcription
print(at5g40780_mrna)
AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA

.translate()는 번역이다. (*은 종결코돈)

 

전사는 그냥 뭐 전사했구나 하고 넘어가도 되는데, 번역은 옵션이 좀 있으니 일단 보고 가자. 

to_stop=True: 처음 만나는 종결코돈에서 멈춘다

from Bio.Seq import Seq
at5g40780_mrna=Seq('AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA')
# 이건 mRNA 시퀀스고
at5g40780_protein=at5g40780_mrna.translate(to_stop=True)
#번역했다.
print(at5g40780_protein)
MVAQAPHDDHQDDEKLAAARQKEIEDWLPITSSRNAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVITLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMWRGVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQIYAMPVFDMMETLLVKKLNFRPTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGLRTIVIQAKGYKFYS


table=2: 아래 사이트에서 제공하는 코돈 테이블로 번역할 수 있다. (기본은 1번 스탠다드)
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

 

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

The Genetic Codes Compiled by Andrzej (Anjay) Elzanowski and Jim Ostell at National Center for Biotechnology Information (NCBI), Bethesda, Maryland, U.S.A. Last update of the Genetic Codes: Jan. 7, 2019 NCBI takes great care to ensure that the translation

www.ncbi.nlm.nih.gov

 

at5g40780_protein=at5g40780_mrna.translate(table=2)
MVAQAPHDDHQDDEKLAAA*QKEIEDWLPITSS*NAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVMTLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMW*GVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQMYAMPVFDMMETLLVKKLNF*PTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGL*TIVIQAKGYKFYS*

위 CDS를 table 2(Vertebrate mitochondria)로 설정하고 번역한 결과. *은 종결 코돈이다. 

stop_symbol="@": 종결코돈의 기호를 @로 바꾼다. 

at5g40780_protein=at5g40780_mrna.translate(table=2,stop_symbol="-")
MVAQAPHDDHQDDEKLAAA-QKEIEDWLPITSS-NAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVMTLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMW-GVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQMYAMPVFDMMETLLVKKLNF-PTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGL-TIVIQAKGYKFYS-

위 예시에서는 -로 바꿨다. 

 

글자 다이렉트로 전사번역하기

시퀀스도 글자긴 한데 str과 다른 점이 있다면 상보적인 시퀀스를 만들 수 있다는 것이다. 근데 그렇다고 헐 시퀀스 안만들었다! 망했다! 이건 아니고... 

from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate #질문있는데 이걸 꼭 이렇게 불러야겠니
insulin="GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCCAACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGCTGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGAGATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAGAACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAACCCCTTGAATGAG"
#Wrap 알아서 켜주면 안되냐고...
insulin_rc=reverse_complement(insulin)
insulin_mrna=transcribe(insulin)
insulin_prot=translate(insulin)
# 순서대로 reverse complement/전사/번역 
print(insulin_rc)
print(insulin_mrna)
print(insulin_prot)

시퀀스때랑 불러오는 건 다르지만 어쨌든 된다. 

 

코돈 테이블 보기

내 블로그 아미노산 배경화면에도 있긴 한데 그걸 누가 외웁니까... 아무튼 그렇다. 

 

from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_table)
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table)

둘 다 같은 테이블을 호출하는건데, 위에껀 이름으로 부르고 밑에껀 ID로 부른다. 이름이나 ID는 위에 올린 NCBI 주소로 가면 나오니까 그걸로 부르면 된다. 

 

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

아무튼 소환했다. 

 

색인

저거 검색도 된다. 실화다. 

 

print(standard_table.stop_codons)
['TAA', 'TAG', 'TGA']

해당 테이블의 Stop codon을 검색하거나 

 

print(mito_table.start_codons)
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']

개시코돈을 검색하거나(저거 2번 테이블이다)

 

print(standard_table.forward_table['AAA'])
K

특정 코돈이 지정하는 아미노산을 찾을 수 있다. 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 가져오기

Coding/Python 2022. 8. 20. 02:03

참고로 다른 코딩글은 여기다가 안 올린다. 코드블럭도 없고 카테고리도 없어서 어지간한 코딩 이야기는 노션과 워드프레스에 올리는 중... 아니면 미디움이나. 근데 이건 우째도 올렸네? 아 생물학 카테고리가 있잖아요. 

Biopython은 생물정보학에서 쓰는건데 마침 심심하던 차에 잘됐다 써봐야징 해서 어제 깔았다. 리눅스의 경우 pip install bio(안되면 biopython)으로 걍 터미널에서 깔면 된다(우분투 20.02 LTS 기준). 윈도우마냥 pip 있는 데 안 찾아가도 됨 ㄹㅇ 편함. 근데 파이참 바로가기 없는 건 너무한 거 아니냐고... 내가 터미널 열고 직접 모시러 가야것냐... 

*Notes: Biopython을 사용하려면 사용할 기능에 맞는 모듈을 모셔와야 한다. 코드에 그게 생략되는 경우가 있는데(부분삽입의 경우) from ~ import ~가 생략된 것이다. 


일단 오늘의 시범조교를 모셔보자. 

ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA

At5g40780의 CDS 시퀀스이다. (코드블럭은 있는데 wrap이 안돼서 걍 생으로 붙였음)

이걸 시퀀스 오브젝트로 생성하는 방법도 간단하다. 그냥 모듈 모셔온 다음 Seq=("")로 감싸버리면 된다. 

at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA")
print(at5g40780)

이런 식으로. 

print(type(at5g40780))
<class 'Bio.Seq.Seq'>

누가봐도 시퀀스다. 

엥 근데 저거 걍 텍스트랑 뭔 차이임? Biopython을 통해 시퀀스로 생성한 데이터는 complement sequence와 reverse complement sequence를 만들 수 있다. 즉, 시퀀스로 불러온 CDS에 상보적인 시퀀스를 명령어 한 방에 만들 수 있는 것이다. 텍스트는 저런거 안된다. 

from Bio.Seq import Seq
at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT")
print("Sequence",at5g40780)
print("Complement sequence",at5g40780.complement())
print("Reverse complement sequence",at5g40780.reverse_complement())

이렇게 입력하면 

Sequence ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT
Complement sequence TACCATCGAGTTCGAGGAGTACTACTAGTAGTCCTACTACTCTTTAATCGTCGTCGCTCTGTTTTTCTCTAGCTTCTAACCAATGGTTAATGAAGTAGTTCTTTACGTTTCACCA
Reverse complement sequence ACCACTTTGCATTTCTTGATGAAGTAATTGGTAACCAATCTTCGATCTCTTTTTGTCTCGCTGCTGCTAATTTCTCATCATCCTGATGATCATCATGAGGAGCTTGAGCTACCAT

짜자잔 

complement는 CDS 시퀀스에 그냥 상보적인거고, reverse complement는 그 상보적인 시퀀스 순서를 뒤집은 모양이다. 


아니 근데 DNA가 겁나 바이트가 길어요... 저 시범조교 CDS도 1300자가 넘는데 어느 세월에 그거 복붙할거임... 그래서 FASTA와 Genbank의 데이터를 직접 열어볼 것이다. 

FASTA 시범조교는 PDB ID 2B10, cytochrome c peroxiase이다. 

>2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTSGTWDKHDNTGGSYGGTYRFKKEFNDPSNAGLQNGFKFLEPIHKEFPWISSGDLFSLGGVTAVQEMQGPKIPWRCGRVDTPEDTTPDNGRLPDADKDADYVRTFFQRLNMNDREVVALMGAHALGKTHLKNSGYEGPWGAANNVFTNEFYLNLLNEDWKLEKNDANNEQWDSKSGYMMLPTDYSLIQDPKYLSIVKEYANDQDKFFKDFSKAFEKLLENGITFPKDAPSPFIFKTLEEQGL
>2B10_2|Chains B, D|Cytochrome c iso-1|Saccharomyces cerevisiae (4932)
TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYTDANIKKNVLWDENNMSEYLTNPKKYIPGTKMASGGLKKEKDRNDLITYLKKACE

FASTA 파일 형식은 이렇게 생겼다. 일단 저 파일을 어떻게 가져올거냐... PDB 가서 2B10으로 검색하고 FASTA 파일을 받아보자. 

for seq_record in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta","fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
2B10_1|Chains
Seq('TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTS...QGL')
294
2B10_2|Chains
Seq('TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYT...ACE')
108

FASTA 파일은 이런 식으로 가져올 수 있다. (seqIO 데려오면 된다) 예? 경로요? 아니 이거 리눅슨디? 

사실 분량때문에 넣지는 못하지만 

print(seq_record)

만 치면 전체 파일을 볼 수 있다. 처음 하시는 분들은 전체를 보고 저 예제를 따라해보는 걸 추천한다. 


for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb","genbank"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

Genbank에서 뭐 검색하고 그거 gk(gbk)파일로 받으면 그게 genbank 파일이다. 이것도 일단 다운받아서 불러와야 한다. 웹에 있는거요? 그거 바로 못갖고온다. 

아무튼 저 코드를 실행하면 

NM_000517.6
Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA')
576

이렇게 된다. 그리고 얘도 FASTA와 마찬가지로 

print(seq_record)

를 주면 전체 파일을 볼 수는 있는데 분량이 어유...

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

방명록