(83)

Biopython으로 Entrez database 탐방하기

이것이 그... BLAST 만든 NCBI에 있는 데이터베이스다. 미국답게 스케일 개크다. 들어가기 전에 보통 Biopython을 쓰거나 랜덤, 넘파이, 판다스를 쓸 때는 뭘 모셔와야 하는데, Entrez에 접속하는 모듈도 마찬가지다. 근데 바이오파이썬은 그걸 떠나서 모셔오는 게 너무 핵가족 스케일이여. 아무튼... 그래서 이번에는 from Bio import Entrez 이걸 필두로 뭘 많이 모셔올 예정인데... 아니 아직 아냐 마저 듣고 가. Entrez에 접속해서 뭘 하려면 저거 말고 필수적으로 입력해야 하는 게 있다. 1. 너님의 API 키 2. 너님의 메일 주소 3. 너님의 매개 변수 셋 중 하나는 반드시 입력해야 하고, 여기서는 이메일을 입력할건데 저거 뭐 이메일 제출한다고 CIA에서 당신 털러..

Biopython으로 BLAST 돌려보기

쿡북에 있는 것 중 일부 생략했다. (BLAT이랑 8.3~8.5부분) BLAST는 Basic local alignment search tool의 준말로, 보통 블래스트라고 한다. 지나가는 2~30대를 붙잡고 크흐~대 기억이~ 하면 지~난 사랑이~ 가 나오듯, 지나가던 생물학도를 붙잡고 블래스트? 하면 NCBI! 하는 정도로 국민 툴이고 전공수업을 듣다 보면 한 번은 쓰고 넘어가는 툴이기도 하다. 애초에 전공이 생물학인지를 먼저 물어봐야 하는 거 아니냐 이 글에 있는 게 어렵다… 그러면 조용히 구글을 열고 BLAST 검색해서 웹사이트 들어가자. 거기서 할 수 있다. NCBIWWW www가 우리말로 ㅋㅋㅋ이긴 한데 그건 일본에서 통하는 말이고... 보통은 월드 와이드 웹이다. BLAST를 따로 깔지 않고도 ..

번외편-코딩테스트 풀이

코테 본다는 얘기는 못 들은 것 같고... 그냥 면접보다가 어디 실력 좀 볼까? 해서 나온거긴 한데... IDE도 셋업 안 된 상태에서 jupyter 데모로 봤었음... 근데 Python을 안 쓰면 나한테 연락을 안 할것인데 이건 뭔 상황인지 모르겠고... 거기다가 실습할 때 썼던 파일들 올릴랬는데 깃헙 뻑나서 오전중에는 그거랑 씨름함... 해결은 봤죠. 참고로 어제 사용한 FASTAQ파일과 오늘 사용한 FASTAQ파일은 다릅니다. 이거 구하기도 개빡셈. 모듈 모셔오기 from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord import numpy as np import pandas as pd 세번째줄 안불러도 됨 ..

Biopython으로 MSA 해보기

MSA: multiple sequence alignment 여기에 관한 이론적인 설명은 나중에 또 입 털어드림 ㅇㅇ 아 참고로 MSA 관련해서 다른건 다 결과가 제대로 나왔는데 툴 관련해서 결과가 안나왔어요 이게 암만 찾아도 답이 없어서 좀 더 알아보고 수정할 가능성이 있음 시범조교 앞으로 오늘은 시범조교 가짓수가 좀 많은데 그 중에서도 FASTA 파일들을 좀 올리고자 한다. 이거 말고도 pfam에서 두갠가 받았는데 그건 걍 가서 암거나 받으면 된다. 해당 파일은 박테리아의 16s rRNA 시퀀스가 들어있는 파일이다. FASTA 파일이라 메모장 있으면 일단 열 수는 있다. 리눅스에서는 gedit으로 만들고 편집하고 다 했다. (vim 안씀) rRNA 시퀀스는 Silva에서 가져왔다. 고마워요 실바! Ag..

Biopython SeqIO 써보기

파이참 쓰다보니 키보드 먹통돼서 재부팅 여러번 했다. 이것도 갈 때 된 듯... *참고로 그냥 리드나 파싱으로 긁어오는 방법은 첫 글에서 썼으므로 생략한다. Iteration 활용하기 근데 들어도 이건 뭔 기능인가 싶긴 함. from Bio import SeqIO identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")] print(identifiers) 뭐야 wrap 넣어줘요... 레코드와 for문을 조합해서 원하는 데이터(ID나 시퀀스, description)만 가져올 수 있다. from Bio import SeqIO identifiers = [seq_record.id fo..

Biopython으로 시퀀스 레코드 생성하고 만져보기

오늘껀 근데 하면서 나도 이해 못했음... SeqRecord를 만들려면 뭘 또 모셔와야 하는데 바로 from Bio.SeqRecord import SeqRecord 이놈이다. 이정도면 복잡한 거 돌리려면 아주 모셔오는데만 열댓줄 쓰게 생겼는데??? 텐서플로우나 저기 그 넘파이 판다스처럼 합쳐버리지 그러냐... (걔네는 일단 한 번 데려오면 다 써먹을 수 있음) 레코드 생성하기 이모지 넣고싶은데 이거 리눅스라 이모지 넣는 법을 모름... 아무튼 레코드는 이렇게 생성하면 된다. from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq EcoRI=Seq("GAATTC") # ECoRI sequence ECoRI_r=SeqRecord(EcoRI) #로 레코드를 ..

Biopython으로 시퀀스 다뤄보기

전에요? 아니 그건 가져온거고. 시퀀스도 텍스트나 리스트처럼 인덱싱과 슬라이싱이 된다. 방법도 똑같다. 그래서 그건 생략할 예정임... 그리고 그거 아니어도 분량이 많아요 또. .count() 특정 문자열이 몇 개인지 세 준다. 뭐 시퀀스에서 아데닌이나 구아닌 갯수 세주고 그런거다. 이걸 이용하면 특정 단백질 시퀀스에서 특정 아미노산(카테고리면 겁나 세야된다...)이 차지하는 비율도 볼 수 있다. 쉬어가는 코너-Primer에서 GC함량 구하기 아 이거 중요합니다. Primer 만들 때 중요한 요소 중 하나가 GC 함량이다. 참고로 여기서는 두 가지 방법으로 구해볼건데 첫번째는 .count()를 이용해 G와 C의 수를 세서 전체 DNA 염기 수로 나누는 거고, 두번째는 바이오파이썬의 모듈을 이용하는 방법이..

Biopython으로 시퀀스 가져오기

참고로 다른 코딩글은 여기다가 안 올린다. 코드블럭도 없고 카테고리도 없어서 어지간한 코딩 이야기는 노션과 워드프레스에 올리는 중... 아니면 미디움이나. 근데 이건 우째도 올렸네? 아 생물학 카테고리가 있잖아요. Biopython은 생물정보학에서 쓰는건데 마침 심심하던 차에 잘됐다 써봐야징 해서 어제 깔았다. 리눅스의 경우 pip install bio(안되면 biopython)으로 걍 터미널에서 깔면 된다(우분투 20.02 LTS 기준). 윈도우마냥 pip 있는 데 안 찾아가도 됨 ㄹㅇ 편함. 근데 파이참 바로가기 없는 건 너무한 거 아니냐고... 내가 터미널 열고 직접 모시러 가야것냐... *Notes: Biopython을 사용하려면 사용할 기능에 맞는 모듈을 모셔와야 한다. 코드에 그게 생략되는 경..

Biopython으로 Entrez database 탐방하기

Coding/Python 2022. 8. 21. 00:05

이것이 그... BLAST 만든 NCBI에 있는 데이터베이스다. 미국답게 스케일 개크다. 


들어가기 전에

보통 Biopython을 쓰거나 랜덤, 넘파이, 판다스를 쓸 때는 뭘 모셔와야 하는데, Entrez에 접속하는 모듈도 마찬가지다. 근데 바이오파이썬은 그걸 떠나서 모셔오는 게 너무 핵가족 스케일이여. 아무튼... 그래서 이번에는 

from Bio import Entrez

이걸 필두로 뭘 많이 모셔올 예정인데... 아니 아직 아냐 마저 듣고 가. 

Entrez에 접속해서 뭘 하려면 저거 말고 필수적으로 입력해야 하는 게 있다. 
1. 너님의 API 키 
2. 너님의 메일 주소
3. 너님의 매개 변수


셋 중 하나는 반드시 입력해야 하고, 여기서는 이메일을 입력할건데 저거 뭐 이메일 제출한다고 CIA에서 당신 털러 오는 거 아니니까 안심하고 쓰자. 오히려 저거 안 쓰면 에러난다. 

 

Einfo

Entrez 자체에 대한 정보를 볼 수 있다. 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com" # 내가 누구인지 말해주는 과정이 필요하다고...
# 자기 이메일 쓰면 됩니다
handle = Entrez.einfo()
result = handle.read()
handle.close()
print(result)
b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD einfo 20190110//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20190110/einfo.dtd">\n<eInfoResult>\n<DbList>\n\n\t<DbName>pubmed</DbName>\n\t<DbName>protein</DbName>\n\t<DbName>nuccore</DbName>\n\t<DbName>ipg</DbName>\n\t<DbName>nucleotide</DbName>\n\t<DbName>structure</DbName>\n\t<DbName>genome</DbName>\n\t<DbName>annotinfo</DbName>\n\t<DbName>assembly</DbName>\n\t<DbName>bioproject</DbName>\n\t<DbName>biosample</DbName>\n\t<DbName>blastdbinfo</DbName>\n\t<DbName>books</DbName>\n\t<DbName>cdd</DbName>\n\t<DbName>clinvar</DbName>\n\t<DbName>gap</DbName>\n\t<DbName>gapplus</DbName>\n\t<DbName>grasp</DbName>\n\t<DbName>dbvar</DbName>\n\t<DbName>gene</DbName>\n\t<DbName>gds</DbName>\n\t<DbName>geoprofiles</DbName>\n\t<DbName>homologene</DbName>\n\t<DbName>medgen</DbName>\n\t<DbName>mesh</DbName>\n\t<DbName>ncbisearch</DbName>\n\t<DbName>nlmcatalog</DbName>\n\t<DbName>omim</DbName>\n\t<DbName>orgtrack</DbName>\n\t<DbName>pmc</DbName>\n\t<DbName>popset</DbName>\n\t<DbName>proteinclusters</DbName>\n\t<DbName>pcassay</DbName>\n\t<DbName>protfam</DbName>\n\t<DbName>biosystems</DbName>\n\t<DbName>pccompound</DbName>\n\t<DbName>pcsubstance</DbName>\n\t<DbName>seqannot</DbName>\n\t<DbName>snp</DbName>\n\t<DbName>sra</DbName>\n\t<DbName>taxonomy</DbName>\n\t<DbName>biocollections</DbName>\n\t<DbName>gtr</DbName>\n</DbList>\n\n</eInfoResult>\n'

...... 아 개비스콘 아저씨 마렵네 이거. 

 

그래서 짤줍했음. 

아니 좀 깔끔하게 못 보여줘요? 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com" # 내가 누구인지 말해주는 과정이 필요하다고...
# 이메일은 자기꺼 그냥 쓰세요
handle = Entrez.einfo()
record = Entrez.read(handle)
handle.close()
print(record)
{'DbList': ['pubmed', 'protein', 'nuccore', 'ipg', 'nucleotide', 'structure', 'genome', 'annotinfo', 'assembly', 'bioproject', 'biosample', 'blastdbinfo', 'books', 'cdd', 'clinvar', 'gap', 'gapplus', 'grasp', 'dbvar', 'gene', 'gds', 'geoprofiles', 'homologene', 'medgen', 'mesh', 'ncbisearch', 'nlmcatalog', 'omim', 'orgtrack', 'pmc', 'popset', 'proteinclusters', 'pcassay', 'protfam', 'biosystems', 'pccompound', 'pcsubstance', 'seqannot', 'snp', 'sra', 'taxonomy', 'biocollections', 'gtr']}

이러시는 이유가 있으실 것 아니예요... 하씨 넘파이 마렵네 이거... 

 

Esearch

이제 본격적으로 찾아볼 시간이다. 핫핫 데이터내놔! 

 

PubMed에서 논문 찾기

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="pubmed", term="biopython[title]", retmax="40" )
record = Entrez.read(handle)
print(record['IdList'])
['34434786', '22909249', '19304878']

바이오파이썬으로 찾았다. (제목에 바이오파이썬이 있는 것) 근데 이 예제 노잼임... 시범조교 불러올거임... 

 

은 바로 이녀석. C. kimchii다. (구 L. kimchii)

김치 related에 당당히 이름을 올렸다. 아니 그야 김치에서 나왔으니까… (자매품으로는 Leuconostoc kimchii도 있다) 근데 얘들아 두부김치도 related 아니냐... 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="pubmed", term="kimchii[title]", retmax="40" )
record = Entrez.read(handle)
print(record['IdList'])
['34320438', '31651376', '31270190', '30361979', '30172442', '29214493', '28920843', '27572507', '27002961', '26370793', '25425317', '25332883', '22140166', '21914872', '21221947', '20494991', '16055277', '15023961', '11931163', '11034505', '11034488']

(대충 개비스콘 아저씨 위장 부여잡는 짤)

 

from Bio import Entrez
import numpy as np
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="pubmed", term="kimchii[title]", retmax="40" )
record = Entrez.read(handle)
result=np.array(record['IdList'])
result=result.reshape(7,3)
print(result)
[['34320438' '31651376' '31270190']
 ['30361979' '30172442' '29214493']
 ['28920843' '27572507' '27002961']
 ['26370793' '25425317' '25332883']
 ['22140166' '21914872' '21221947']
 ['20494991' '16055277' '15023961']
 ['11931163' '11034505' '11034488']]

아 편안하네. 

 

from Bio import Entrez
import numpy as np
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="pubmed", term="kimchii[title] and Leuconostoc[title]", retmax="40" )
record = Entrez.read(handle)
# 출력 안이뻐서 배열 만들었음
result=np.array(record['IdList'])
result=result.reshape(2,4)
print(result)
# 바람직한 출력이야!
[['34320438' '31270190' '30172442' '25425317']
 ['25332883' '21914872' '20494991' '11034505']]

연산자도 먹힌다. 위 코드는 Leuconostoc kimchii를 찾는 연산자. 미역김치 있나요 그건 저거 다 찾아보면 나올듯 

 

Nucleotide DB 찾기

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="nucleotide", term="Arabidopsis[Orgn] AND LHT1[Gene]", idtype="acc")
record = Entrez.read(handle)
print("%s founded" % record["Count"])
print(record["IdList"])
5 founded
['NM_001344354.1', 'NM_180778.4', 'NM_001344353.1', 'NC_003076.8', 'CP002688.1']

애기장대(Arabodopsis)의 LHT1에 대해 찾으면 다섯 개가 나온다. 아까 걔는 없음.. ㅋㅋㅋㅋㅋㅋㅋㅋ 

 

저널 찾기

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esearch(db="nlmcatalog", term="computational[Journal]", retmax="20")
record = Entrez.read(handle)
print("{} computational journals found".format(record["Count"]))
print("The first 20 are\n{}".format(record["IdList"]))
187 computational journals found
The first 20 are
['101775476', '101775136', '468912', '468839', '467370', '101775780', '101774751', '466063', '101768752', '101765300', '101759185', '101752828', '464655', '101768811', '101755127', '101753951', '101753371', '101740904', '101737789', '101736625']

저널? 그럼 이거 인공지능 저널 이런것도 찾아주나? 

 

Epost

있으니까 보고 가긴 하는데 이거 뭔지는 모르겠음... 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
id_list = ["19304878", "18606172", "16403221", "16377612", "14871861", "14630660"]
print(Entrez.epost("pubmed", id=",".join(id_list)).read())
b'<?xml version="1.0" encoding="UTF-8" ?>\n<!DOCTYPE ePostResult PUBLIC "-//NLM//DTD epost 20090526//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20090526/epost.dtd"><ePostResult>\n\t<QueryKey>1</QueryKey>\n\t<WebEnv>MCID_618c9fbf207466500268a6a2</WebEnv>\n</ePostResult>\n'

(대충 개비스콘 아저씨 속쓰린 짤)

 

Esummary

검색결과의 첫 번째 데이터에 대해 요약본을 보여준다. 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.esummary(db="pubmed", id="31651376")
record = Entrez.read(handle)
info = record[0]
print("Journal info\nid: {}\nTitle: {}".format(record[0]["Id"], info["Title"]))
Journal info
id: 31651376
Title: <i>Lactococcus kimchii</i> sp. nov., a new lactic acid bacterium isolated from kimchi.

 

Efetch

이거 봤으면 엔트레즈 다 본거다. ...그렇다고 보다말고 가지는 말고... 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
print(handle.read())
LOCUS       EU490707                1302 bp    DNA     linear   PLN 26-JUL-2016
DEFINITION  Selenipedium aequinoctiale maturase K (matK) gene, partial cds;
            chloroplast.
ACCESSION   EU490707
VERSION     EU490707.1
KEYWORDS    .
SOURCE      chloroplast Selenipedium aequinoctiale
  ORGANISM  Selenipedium aequinoctiale
            Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
            Spermatophyta; Magnoliopsida; Liliopsida; Asparagales; Orchidaceae;
            Cypripedioideae; Selenipedium.
REFERENCE   1  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A., Endara,L.,
            Williams,N.H. and Moore,M.
  TITLE     Phylogenetic utility of ycf1 in orchids: a plastid gene more
            variable than matK
  JOURNAL   Plant Syst. Evol. 277 (1-2), 75-84 (2009)
REFERENCE   2  (bases 1 to 1302)
  AUTHORS   Neubig,K.M., Whitten,W.M., Carlsward,B.S., Blanco,M.A.,
            Endara,C.L., Williams,N.H. and Moore,M.J.
  TITLE     Direct Submission
  JOURNAL   Submitted (14-FEB-2008) Department of Botany, University of
            Florida, 220 Bartram Hall, Gainesville, FL 32611-8526, USA
FEATURES             Location/Qualifiers
     source          1..1302
                     /organism="Selenipedium aequinoctiale"
                     /organelle="plastid:chloroplast"
                     /mol_type="genomic DNA"
                     /specimen_voucher="FLAS:Blanco 2475"
                     /db_xref="taxon:256374"
     gene            <1..>1302
                     /gene="matK"
     CDS             <1..>1302
                     /gene="matK"
                     /codon_start=1
                     /transl_table=11
                     /product="maturase K"
                     /protein_id="ACC99456.1"
                     /translation="IFYEPVEIFGYDNKSSLVLVKRLITRMYQQNFLISSVNDSNQKG
                     FWGHKHFFSSHFSSQMVSEGFGVILEIPFSSQLVSSLEEKKIPKYQNLRSIHSIFPFL
                     EDKFLHLNYVSDLLIPHPIHLEILVQILQCRIKDVPSLHLLRLLFHEYHNLNSLITSK
                     KFIYAFSKRKKRFLWLLYNSYVYECEYLFQFLRKQSSYLRSTSSGVFLERTHLYVKIE
                     HLLVVCCNSFQRILCFLKDPFMHYVRYQGKAILASKGTLILMKKWKFHLVNFWQSYFH
                     FWSQPYRIHIKQLSNYSFSFLGYFSSVLENHLVVRNQMLENSFIINLLTKKFDTIAPV
                     ISLIGSLSKAQFCTVLGHPISKPIWTDFSDSDILDRFCRICRNLCRYHSGSSKKQVLY
                     RIKYILRLSCARTLARKHKSTVRTFMRRLGSGLLEEFFMEEE"
ORIGIN      
        1 attttttacg aacctgtgga aatttttggt tatgacaata aatctagttt agtacttgtg
       61 aaacgtttaa ttactcgaat gtatcaacag aattttttga tttcttcggt taatgattct
      121 aaccaaaaag gattttgggg gcacaagcat tttttttctt ctcatttttc ttctcaaatg
      181 gtatcagaag gttttggagt cattctggaa attccattct cgtcgcaatt agtatcttct
      241 cttgaagaaa aaaaaatacc aaaatatcag aatttacgat ctattcattc aatatttccc
      301 tttttagaag acaaattttt acatttgaat tatgtgtcag atctactaat accccatccc
      361 atccatctgg aaatcttggt tcaaatcctt caatgccgga tcaaggatgt tccttctttg
      421 catttattgc gattgctttt ccacgaatat cataatttga atagtctcat tacttcaaag
      481 aaattcattt acgccttttc aaaaagaaag aaaagattcc tttggttact atataattct
      541 tatgtatatg aatgcgaata tctattccag tttcttcgta aacagtcttc ttatttacga
      601 tcaacatctt ctggagtctt tcttgagcga acacatttat atgtaaaaat agaacatctt
      661 ctagtagtgt gttgtaattc ttttcagagg atcctatgct ttctcaagga tcctttcatg
      721 cattatgttc gatatcaagg aaaagcaatt ctggcttcaa agggaactct tattctgatg
      781 aagaaatgga aatttcatct tgtgaatttt tggcaatctt attttcactt ttggtctcaa
      841 ccgtatagga ttcatataaa gcaattatcc aactattcct tctcttttct ggggtatttt
      901 tcaagtgtac tagaaaatca tttggtagta agaaatcaaa tgctagagaa ttcatttata
      961 ataaatcttc tgactaagaa attcgatacc atagccccag ttatttctct tattggatca
     1021 ttgtcgaaag ctcaattttg tactgtattg ggtcatccta ttagtaaacc gatctggacc
     1081 gatttctcgg attctgatat tcttgatcga ttttgccgga tatgtagaaa tctttgtcgt
     1141 tatcacagcg gatcctcaaa aaaacaggtt ttgtatcgta taaaatatat acttcgactt
     1201 tcgtgtgcta gaactttggc acggaaacat aaaagtacag tacgcacttt tatgcgaaga
     1261 ttaggttcgg gattattaga agaattcttt atggaagaag aa
//

어... 뭔진 모르겠지만 일단 제가 잘못했어요? 

 

긁어와서 레코드 만들기

from Bio import SeqIO
from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.efetch(db="nucleotide", id="NM_180778.4", rettype="gb", retmode="text")
record = SeqIO.read(handle, "genbank") # 레코드 형식으로 불러와서 부분적으로 취사선택할 수 있다.
handle.close() # 이거 꼭 닫아야됨?
print(record.description)
Arabidopsis thaliana lysine histidine transporter 1 (LHT1), mRNA

이런 식으로 취사선택해서 보는 것도 된다. 

 

아 긁어왔으면 저장해야지 

참고로 이거 저장은 Entrez 모듈 단독으로 못하고 SeqIO 불러야된다. 

 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.efetch(db="nucleotide", id="NM_180778.4", rettype="gb", retmode="text")
record=SeqIO.read(handle,"genbank")
handle.close()
print(record)
SeqIO.write(record,"/home/koreanraichu/NM_180778.4.fasta","fasta")
# 사실 여기까진 쿡북에 없었는데 저장할 수 있는 방법이 없나 해서 해봤음.

사실 genbank 포맷인데 저걸 못열어서 fasta로 저장함... 

 

Elink

뭐 관련된 걸 찾아준다고 한다. DNA 시퀀스로 유전자 찾아주듯이... 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
pmid = "19304878"
record = Entrez.read(Entrez.elink(dbfrom="pubmed", id=pmid))
print(record[0]['IdList'])
['19304878']

이런 식으로 쓰고 

for linksetdb in record[0]["LinkSetDb"]:
    print(linksetdb["DbTo"], linksetdb["LinkName"], len(linksetdb["Link"]))
pubmed nuccore_pubmed 1
pubmed nuccore_pubmed_refseq 9
pubmed nuccore_pubmed_weighted 9

for문 먹여도 된단다. 

 

EGQuery

뭐야 쿼리가 왜 여기서 나와요? 그러게요. 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.egquery(term="kimchii")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    print(row["DbName"], row["Count"])
pubmed 50
pmc 154
mesh 5
books 0
pubmedhealth Error
omim 0
ncbisearch 4
nuccore 928
nucgss 0
nucest 0
protein 33512
genome 2
structure 0
taxonomy 0
snp 0
dbvar 0
gene 10109
sra 3
biosystems 295
unigene 0
cdd 0
clone 0
popset 25
geoprofiles 0
gds 0
homologene 0
pccompound 0
pcsubstance 0
pcassay 0
nlmcatalog 0
probe 0
gap 0
proteinclusters 0
bioproject 11
biosample 15
biocollections 0

그냥 얘네 DB에 얼마나 있는지가 나온다. 워우. 

 

ESpell

엥? 스펠? 마법사 찍고 올까요? 아니 그거 아냐... 물론 난 작년에 전직했지만 아무튼 그거 아님. 여기서 스펠은 스펠링(철자)의 스펠이지 마법 스펠이 아니다. 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.espell(term="Conpanilactobacillus")
record = Entrez.read(handle)
print(record["Query"])
print(record["CorrectedQuery"])
Conpanilactobacillus
companilactobacillus

위: 오타 교정 전
아래: 오타 교정 후


사용예

개발자가 또 올려놔서 해봤음 

 

Pubmed와 Medline의 대환장 콜라보레이션

일단 최종 코드가 두 개다. 

 

from Bio import Entrez
from Bio import Medline
import numpy as np
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.egquery(term="kimchii")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    if row["DbName"] == "pubmed":
        print(row["Count"])
# 쿼리를 통해 kimchii가 들어가는 걸 데려온 다음, pubmed에 있는 것만 뽑았다.
handle = Entrez.esearch(db="pubmed", term="kimchii", retmax=50)
record = Entrez.read(handle)
handle.close()
idlist = record["IdList"]
# 이제 리스트가 많으면 뽑아서 확인할 때 알아서 numpy를 불러봅니다.
handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline",retmode="text")
records = Medline.parse(handle)
# 이렇게 된 이상 medline으로 간다!

handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline",retmode="text")
records = Medline.parse(handle)
for record in records:
    print("title:", record.get("TI", "?"))
    print("authors:", record.get("AU", "?"))
    print("source:", record.get("SO", "?"))
    print("")
# 전체 출력은 여기서(전체 결과 중 제목, 저자, source를 출력)
from Bio import Entrez
from Bio import Medline
import numpy as np
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.egquery(term="kimchii")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    if row["DbName"] == "pubmed":
        print(row["Count"])
# 쿼리를 통해 kimchii가 들어가는 걸 데려온 다음, pubmed에 있는 것만 뽑았다.
handle = Entrez.esearch(db="pubmed", term="kimchii", retmax=50)
record = Entrez.read(handle)
handle.close()
idlist = record["IdList"]
# 이제 리스트가 많으면 뽑아서 확인할 때 알아서 numpy를 불러봅니다.
handle = Entrez.efetch(db="pubmed", id=idlist, rettype="medline",retmode="text")
records = Medline.parse(handle)
# 이렇게 된 이상 medline으로 간다!

search_title = "Leuconostoc"
for record in records:
    if not "TI" in record:
        continue
    if search_title in record["TI"]:
        print("Keyword %s found: %s" % (search_title, record["TI"]))
# 제목에 Leuconostoc이 있는 것만 뽑아달라!

위 코드와 아래 코드의 공통 분기는 다음과 같다. 

1. 쿼리로 kimchii 찾아서 
2. pubmed에 있는 것만 뽑아서 
3. ID 리스트업해서 
4. 어디가 찾아야지 

그리고 갈라지는 분기는 여기서 전체 목록을 출력하는가 or Leuconostoc으로 제목을 한번 더 필터링하는가 여부. 저거 둘 다 쓰면 안먹혀서 분기 갈라진거다... 

 

Nucleotide 코드 받기

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.egquery(term="Arabidopsis")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    if row["DbName"] == "nuccore":
        print(row["Count"])
# 애기장대로 긁어와서 nuccore에 있는 것만 출력해라
handle = Entrez.esearch(db="nucleotide", term="Arabidopsis", retmax=800, idtype="acc")
record = Entrez.read(handle)
handle.close()
# 올해 얼마나 나왔는지는 모르겠고 일단 800개만 뽑아보자
print(record["IdList"][:5])
idlist = ",".join(record["IdList"][:5])
# 레코드에서 다섯개 뽑아서 리스트업한다.
handle = Entrez.efetch(db="nucleotide", id=idlist, retmode="xml")
records = Entrez.read(handle)
print(records[0].keys())
# 0번째 레코드 키좀 주세요
print(records[0]["GBSeq_length"])
# 시퀀스 길이 줘봐봐

얘도 쿼리 긁는 건 똑같다. 찾는 DB가 다르지... 그리고 예제 코드와 달리 데이터가 방대해서 어쩔 수 없이 위에서 5개만 뽑았다. 안그러면 안끝나 이거... 

 

순순히 Genbank 레코드를 내놓는다면 유혈사태는 면할것이다 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일
handle = Entrez.egquery(term="Leuconostoc AND kimchii") # 뭐야 연산자 이렇게 쓰면 됨?
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
    if row["DbName"] == "nuccore":
        print(row["Count"])
# nuccore에 수록되어있는 L.kimchii에 대해 찾아보자. (Leuconostoc kimchii)
handle = Entrez.esearch(db="nuccore", term="Leuconostoc AND kimchii")
record = Entrez.read(handle)
gi_list = record["IdList"]
# GI list를 만들어서 출력해보자
gi_str = ",".join(gi_list[0:5])
handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb", retmode="text")
text = handle.read()
print(text)
# 이거 저장 안됨?
handle = Entrez.efetch(db="nuccore", id=gi_str, rettype="gb", retmode="text")
records = SeqIO.parse(handle, "gb")
for record in records:
    print("%s, length %i, with %i features" % (record.name, len(record), len(record.features)))
# Parsing에는 for문이 국룰이다.

제목 저거 괜찮냐 이쯤되면 parsing에는 for문이 국룰 아니냐... 아무튼. ...아니 잠깐만 쿼리 연산자 저렇게 쓰는거임? 

아무튼 그렇다. 

 

Taxonomy lineage

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일!!!
handle = Entrez.esearch(db="Taxonomy", term="Lactobacillus")
record = Entrez.read(handle)
print(record['IdList'])
# Taxonomy에서 lactobacillus를 찾아 ID를 내놓아라!
handle = Entrez.efetch(db="Taxonomy", id="1578", retmode="xml")
records = Entrez.read(handle)
print(records[0]['Lineage'])
# Lineage 주세요!
from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
# 아 맞다 메일!!!
handle = Entrez.esearch(db="Taxonomy", term="Enterococcus")
record = Entrez.read(handle)
print(record['IdList'])
# Taxonomy에서 Enterococcus를 찾아 ID를 내놓아라!
handle = Entrez.efetch(db="Taxonomy", id=record['IdList'], retmode="xml")
records = Entrez.read(handle)
print(records[0]['Lineage'])
# Lineage 주세요!

위 코드와 아래 코드 둘 다 일단 똑같은데, 아래 코드는 taxonomy 번호 결과를 아예 변수로 떄려박아서 다이렉트로 찾아준다. (위 코드는 직접 입력해야 함) 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 BLAST 돌려보기

Coding/Python 2022. 8. 20. 23:47

쿡북에 있는 것 중 일부 생략했다. (BLAT이랑 8.3~8.5부분) 


BLAST는 Basic local alignment search tool의 준말로, 보통 블래스트라고 한다. 지나가는 2~30대를 붙잡고 크흐~대 기억이~ 하면 지~난 사랑이~ 가 나오듯, 지나가던 생물학도를 붙잡고 블래스트? 하면 NCBI! 하는 정도로 국민 툴이고 전공수업을 듣다 보면 한 번은 쓰고 넘어가는 툴이기도 하다. 애초에 전공이 생물학인지를 먼저 물어봐야 하는 거 아니냐 

이 글에 있는 게 어렵다… 그러면 조용히 구글을 열고 BLAST 검색해서 웹사이트 들어가자. 거기서 할 수 있다. 


NCBIWWW

www가 우리말로 ㅋㅋㅋ이긴 한데 그건 일본에서 통하는 말이고... 보통은 월드 와이드 웹이다. BLAST를 따로 깔지 않고도 돌리려면 저게 있어야 한다. 물론 BLAST 깔려있는 서버가 있으면 저거 쓰지 말고 걍 로컬 돌리자. 저거 생각보다 느리다. 어느정도로 느리냐면 돌려놓고 강의실 있는 건물에 있는 카페에서 커피 한 잔 마시고 오면 끝난다. 돌려놓고 설거지 한번 하면 된다 

from Bio.Blast import NCBIWWW

아무튼 오늘은 이 친구를 필두로 뭘 또 많이 데려올 예정이다. 

 

인터넷으로 BLAST 돌리기-기본편

참고로 로컬로 돌리는 건 생략. 놋북에 그게 안 깔려있다. 

 

from Bio.Blast import NCBIWWW
result_handle = NCBIWWW.qblast("blastn", "nt", "8332116")
print(result_handle)

기본 코드는 이거다. 이걸 실행하면 더럽게 오래 걸리면서... 

<_io.StringIO object at 0x7f2f672ff280>

이러시는 이유가 있으실 것 아니예요. 

쟤가 왜 저러는지는 이따 저장할 때 설명하기로 하고, 저 세 개가 뭔지에 대해 알아보자. Biopython에서 BLAST를 돌릴 때는 qblast()를 쓰는데, 거기서 필요한 3대 요소가 뭘로 돌릴래/데이터베이스 어디서 찾을래/GI 번호이다. GI 번호는 

이런 식으로 Genbank에서 뭔가를 찾으면 나오는 번호. 즉 저 코드는 GI 번호가 8332116인 유전자의 시퀀스로 blastn을 돌리라는 얘기다. (DB 저거 약어같은데 나도 뭔지 모름) 혹시 여기에 대해 더 궁금한 점이 있다면 

HowTo_BLASTGuide.pdf
0.89MB

한번 읽어보자. NCBI에서 제공하는 가이드이다. (물론 영어다) 

 

인터넷으로 BLAST 돌리기-FASTA 파일로 돌리기

Genbank도 돌릴 수는 있는데 뭐가 문젠지 결과를 조졌다... 결과 파일에 암것도 없음. 

 

from Bio.Blast import NCBIWWW
fasta_string = open("/home/koreanraichu/CaMV.fasta").read()
result_handle = NCBIWWW.qblast("blastn", "nt", fasta_string)
print(result_handle)
from Bio import SeqIO
from Bio.Blast import NCBIWWW
record = SeqIO.read("/home/koreanraichu/CaMV.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
print(result_handle)
from Bio import SeqIO
from Bio.Blast import NCBIWWW
record = SeqIO.read("m_cold.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.format("fasta"))
print(result_handle)

셋 다 쓸 수 있다. Parse는... 그거 쓸거면 for로 불러와서 시퀀스 픽한다음 몇번쨰 시퀀스로 찾을지 골라야 하지 않나... 

 

결과 파싱하기

from Bio import SeqIO
from Bio.Blast import NCBIWWW
record = SeqIO.read("/home/koreanraichu/CaMV.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
result_handle = open("/home/koreanraichu/CaMV_BLAST.xml")
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
record = SeqIO.read("/home/koreanraichu/CaMV.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
blast_record = NCBIXML.read(result_handle)
print(blast_record)

둘 다 해봤는데 전 아래꺼요... (위에꺼 오류남...)

 

save_file = open("my_blast.xml", "w") # 이거 뭐 하는 구문인데 경초 못 찾는다고 에러뜨냐
save_file.write(result_handle.read())
save_file.close()
result_handle.close()

저장할거면 얘도 추가해야된다. 아까 처음 실행할 때 뭔 괴상한 문자만 땡 하고 끝났던 걸 기억하는가? Biopython에서 BLAST 돌리면 결과를 알아서 저장해주는 게 아니라 BLAST 결과에 대한 핸들을 생성한다. 그리고 그 핸들을 읽는 게 다라서 밑에 올린 저장용 코드를 따로 또 쳐 줘야 읽은 핸들을 xml파일로 저장한다. 

 

from Bio import SeqIO
from Bio.Blast import NCBIWWW
record = SeqIO.read("/home/koreanraichu/CaMV.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
save_file = open("my_blast.xml", "w") # 이거 뭐 하는 구문인데 경로 못 찾는다고 에러뜨냐
save_file.write(result_handle.read())
save_file.close()
result_handle.close()
# 위에 네 줄까지 추가해야 결과가 저장된다.
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
for seq_record in SeqIO.parse("/home/koreanraichu/coronavirus.gb","genbank"): # 이거 이래 불러야되나
    print(seq_record.seq) # Genbank 가져올거면 이렇게 쓰세요
result_handle = NCBIWWW.qblast("blastn", "nt", seq_record.seq)
blast_record = NCBIXML.parse(result_handle)
# 결과가 하나면 read, 여러개면 parse
# 이것도 parse에 for문이 국룰인가?
save_file = open("coronavirus.xml", "w") # 이거 뭐 하는 구문인데 경로 못 찾는다고 에러뜨냐
save_file.write(result_handle.read())
save_file.close()
result_handle.close()
# 위에 네 줄까지 추가해야 결과가 저장된다

근데 Genbank는 레코드 뽑는 것까지 성공했는데 BLAST 결과가 없음으로 나온다. 웹으로 한번 돌려봐야되는데 문제는 .gb파일이 안열린다는 거… (게임보이 뭐시기가 필요해서 파이참으로 바이오파이썬 불러서 열어야 한다)

 

아무튼 이런 식으로 저장된다. (경로 지정을 안 했더니 파이참 폴더행)

 

BLAST 레코드

아까 저장한 XML파일! 와! 우리꺼! 결과! 열어보자! 

원래 저렇게 생긴 파일이다. 

 

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
with open("my_blast.xml") as bf:
    blast_records = NCBIXML.parse(bf)
    E_VALUE_THRESH = 0.04 # 일단 이거 높으면 안좋은거다
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
            for hsp in alignment.hsps:
                if hsp.expect < E_VALUE_THRESH:
                    print("****Alignment****")
                    print("sequence:", alignment.title)
                    print("length:", alignment.length)
                    print("e value:", hsp.expect)
                    print(hsp.query[0:75] + "...")
                    print(hsp.match[0:75] + "...")
                    print(hsp.sbjct[0:75] + "...")
# hsp 영역을 추출해 내는 스크립트.

여기서 특정 영역을 추출할 수 있는데, 이 코드에서는 hsp 영역을 추출했다. 웹으로 맨날 보는 그 부분... (결과는 코드블럭에서 길다고 짤렸다...)

 

****Alignment****
sequence: gi|1835923315|gb|MN508834.1| Binary vector pRATIO3212-SMXL7, complete sequence
length: 14584
e value: 3.74077e-174
TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGGATTGTGCGTCATCCCT...
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGGATTGTGCGTCATCCCT...

이런 게 계~속 반복된다. 

응? 벡터요? 시범조교로 벡터 쓰심? 아니 벡터가 아니라 CaMV 시퀀스 썼는데 이게 35S promoter 시퀀스였던 모양이다. CaMV는 콜리플라워 모자이크 바이러스로, 볼티모어 클래스 7(DNA-RT)에 속한다. 엔빌롭이 있고… 근데 저 바이러스가 벡터랑 뭔 상관이냐고? 유전자 과발현 벡터에는 CaMV 바이러스의 35S promoter가 들어간다. 

 

from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML

record = SeqIO.read("/home/koreanraichu/coronavirus.gb", format="genbank")
result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)
blast_records = NCBIXML.parse(result_handle)

E_VALUE_THRESH = 0.04 # 일단 이거 높으면 안좋은거다
for blast_record in blast_records:
    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < E_VALUE_THRESH:
                print("****Alignment****")
                print("sequence:", alignment.title)
                print("length:", alignment.length)
                print("e value:", hsp.expect)
                print(hsp.query[0:75] + "...")
                print(hsp.match[0:75] + "...")
                print(hsp.sbjct[0:75] + "...")
# hsp 영역을 추출해 내는 스크립트.

아 저장 필요없고 결과나 보여주셈! 하면 다이렉트로 돌려서 결과 보는 것도 된다. 참고로 저기 있는 E-value는 일단 클수록 안 좋다. 

 

SearchIO

뭔 IO가 들어가요? 저거 사실 인풋/아웃풋 약어다. 

아까 그 으아악 내눈을 유발했던 XML파일... 거기서 우리가 원하는 정보를 볼 수 있게 해 주는 게 이 친구 역할이다. (대충 개비스콘 아저씨 짤)

 

from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
print(blast_qresult)
Program: blastn (2.12.0+)
  Query: No (345)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      1  gi|1835923315|gb|MN508834.1|  Binary vector pRATIO3212-...
            1      1  gi|1835923313|gb|MN508833.1|  Binary vector pRATIO3212-...
            2      1  gi|1835923311|gb|MN508832.1|  Binary vector pRATIO1212-...
            3      1  gi|1755168873|gb|MN055609.1|  Plant binary expression v...
            4      1  gi|1755168871|gb|MN055608.1|  Plant binary expression v...
            5      1  gi|1755168869|gb|MN055607.1|  Plant binary expression v...
            6      1  gi|1755168867|gb|MN055606.1|  Plant binary expression v...
            7      1  gi|1755168865|gb|MN055605.1|  Plant binary expression v...
            8      1  gi|1755168863|gb|MN055604.1|  Plant binary expression v...
            9      1  gi|1732472841|gb|MK896906.1|  Plant binary expression v...
           10      1  gi|1732472839|gb|MK896905.1|  Plant binary expression v...
           11      1  gi|1732472836|gb|MK896904.1|  Plant binary expression v...
           12      1  gi|1732472834|gb|MK896903.1|  Plant binary expression v...
           13      3  gi|1732472826|gb|MK896900.1|  Plant binary expression v...
           14      3  gi|1732472823|gb|MK896899.1|  Plant binary expression v...
           15      3  gi|1732472819|gb|MK896898.1|  Plant binary expression v...
           16      3  gi|1732472816|gb|MK896897.1|  Plant binary expression v...
           17      3  gi|1732472812|gb|MK896896.1|  Plant binary expression v...
           18      1  gi|1693859709|gb|MK204379.1|  Binary vector pBI-Xeg, co...
           19      1  gi|1624318504|gb|MK501803.1|  Expression vector pHytru,...
           20      1  gi|1050047859|dbj|AB830573.1|  Gateway vector pB5nYGW D...
           21      1  gi|1050047854|dbj|AB830572.1|  Gateway vector pB5nRGW D...
           22      1  gi|1050047849|dbj|AB830571.1|  Gateway vector pB5nGGW D...
           23      1  gi|1050047842|dbj|AB830570.1|  Gateway vector pB5nCGW D...
           24      1  gi|1050047835|dbj|AB830569.1|  Gateway vector pB5GWnY D...
           25      1  gi|1050047829|dbj|AB830568.1|  Gateway vector pB5GWnR D...
           26      1  gi|1050047822|dbj|AB830567.1|  Gateway vector pB5GWnG D...
           27      1  gi|1050047815|dbj|AB830566.1|  Gateway vector pB5GWnC D...
           28      1  gi|1050047807|dbj|AB830565.1|  Gateway vector pB5GWcY D...
           29      1  gi|1050047801|dbj|AB830564.1|  Gateway vector pB5GWcR D...
           ~~~
           47      3  gi|2087986960|gb|MW026675.1|  Plant expression vector p...
           48      3  gi|2033742660|gb|MW557527.1|  Cloning vector pGWB-cLUC,...
           49      3  gi|2033742656|gb|MW557526.1|  Cloning vector pGWB-nLUC,...

매우 편안한 것을 알 수 있다. 저거 판다스 데이터프레임으로 묶어서 그룹바이 마렵네 

 

SearchIO iteration

for hit in blast_qresult:
    print(hit)
Query: No
       definition line
  Hit: gi|1835923315|gb|MN508834.1| (14584)
       Binary vector pRATIO3212-SMXL7, complete sequence
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0  3.7e-174     623.45     345          [0:345]             [791:1136]

저런 게 반복된다. 뭐... 리스트나 딕셔너리마냥 iteration, 인덱싱에 슬라이싱도 되고, 특정 영역(id, sequence length)만 볼 수도 있다. 물론 슬라이싱과 동시적용도 된다. 

 

sort_key = lambda hit: hit.seq_len
sorted_qresult = blast_qresult.sort(key=sort_key, reverse=True, in_place=False)
for hit in sorted_qresult:
    print("%s %i" % (hit.id, hit.seq_len))
gi|1835923315|gb|MN508834.1| 14584
gi|1835923313|gb|MN508833.1| 14548
gi|1693859709|gb|MK204379.1| 13667
gi|2087986978|gb|MW026679.1| 13558
gi|2087986968|gb|MW026677.1| 13439
gi|2033742656|gb|MW557526.1| 13430
gi|2087986973|gb|MW026678.1| 13025
gi|1732472816|gb|MK896897.1| 13004
gi|2087986983|gb|MW026680.1| 12906
gi|1835923311|gb|MN508832.1| 12814
gi|1732472819|gb|MK896898.1| 12754
gi|2033742660|gb|MW557527.1| 12656
gi|1050047859|dbj|AB830573.1| 12546
gi|1050047849|dbj|AB830571.1| 12546
gi|1050047842|dbj|AB830570.1| 12546
gi|1050047835|dbj|AB830569.1| 12531
gi|1050047822|dbj|AB830567.1| 12531
gi|1050047815|dbj|AB830566.1| 12531
gi|1050047854|dbj|AB830572.1| 12519
gi|1050047829|dbj|AB830568.1| 12504
gi|2087986964|gb|MW026676.1| 12332
gi|1050047782|dbj|AB830561.1| 12266
gi|1050047801|dbj|AB830564.1| 12249
gi|1050047720|dbj|AB830552.1| 12228
gi|1050047788|dbj|AB830562.1| 12222
gi|1050047776|dbj|AB830560.1| 12222
gi|1050047770|dbj|AB830559.1| 12213
gi|1050047755|dbj|AB830557.1| 12213
gi|1050047748|dbj|AB830556.1| 12213
gi|1050047807|dbj|AB830565.1| 12207
gi|1050047794|dbj|AB830563.1| 12207
gi|1050047762|dbj|AB830558.1| 12186
gi|1050047734|dbj|AB830554.1| 11931
gi|1050047741|dbj|AB830555.1| 11889
gi|1050047727|dbj|AB830553.1| 11889
gi|2087986960|gb|MW026675.1| 11799
gi|1732472826|gb|MK896900.1| 11186
gi|1732472823|gb|MK896899.1| 11185
gi|1732472812|gb|MK896896.1| 10473
gi|1732472836|gb|MK896904.1| 6122
gi|1755168865|gb|MN055605.1| 5507
gi|1732472841|gb|MK896906.1| 5507
gi|1755168863|gb|MN055604.1| 5506
gi|1732472839|gb|MK896905.1| 5506
gi|1755168869|gb|MN055607.1| 5498
gi|1755168867|gb|MN055606.1| 5497
gi|1755168873|gb|MN055609.1| 5465
gi|1755168871|gb|MN055608.1| 5464
gi|1624318504|gb|MK501803.1| 4844
gi|1732472834|gb|MK896903.1| 4794

판다스마냥 특정 요소를 기준으로 정렬할 수도 있다. (시퀀스 길이 기준으로 정렬한 것)

 

filter_func = lambda hit: len(hit.hsps) > 1
filtered_qresult = blast_qresult.hit_filter(filter_func)
print(filtered_qresult)
Query: No (345)
         definition line
 Target: nt
   Hits: ----  -----  ----------------------------------------------------------
            #  # HSP  ID + description
         ----  -----  ----------------------------------------------------------
            0      3  gi|1732472826|gb|MK896900.1|  Plant binary expression v...
            1      3  gi|1732472823|gb|MK896899.1|  Plant binary expression v...
            2      3  gi|1732472819|gb|MK896898.1|  Plant binary expression v...
            3      3  gi|1732472816|gb|MK896897.1|  Plant binary expression v...
            4      3  gi|1732472812|gb|MK896896.1|  Plant binary expression v...
            5      3  gi|2087986983|gb|MW026680.1|  Plant expression vector p...
            6      3  gi|2087986978|gb|MW026679.1|  Plant expression vector p...
            7      3  gi|2087986973|gb|MW026678.1|  Plant expression vector p...
            8      3  gi|2087986968|gb|MW026677.1|  Plant expression vector p...
            9      3  gi|2087986964|gb|MW026676.1|  Plant expression vector p...
           10      3  gi|2087986960|gb|MW026675.1|  Plant expression vector p...
           11      3  gi|2033742660|gb|MW557527.1|  Cloning vector pGWB-cLUC,...
           12      3  gi|2033742656|gb|MW557526.1|  Cloning vector pGWB-nLUC,...

Process finished with exit code 0

함수 걸어서 필터링도 된다는데 람다가 국룰인건가... 

 

Hit, HSP, Fragment

무슨 족보 그리는것도 아니고 하위의 하위 뭔데... 

 

Hit

아까 hit.id 이런 식으로 잠깐 지나쳤던 것. 어쨌거나 상위 개념인 듯 하다. 

 

from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
blast_hit = blast_qresult[0]
print(blast_hit)
Query: No
       definition line
  Hit: gi|1835923315|gb|MN508834.1| (14584)
       Binary vector pRATIO3212-SMXL7, complete sequence
 HSPs: ----  --------  ---------  ------  ---------------  ---------------------
          #   E-value  Bit score    Span      Query range              Hit range
       ----  --------  ---------  ------  ---------------  ---------------------
          0  3.7e-174     623.45     345          [0:345]             [791:1136]

색인할 수 있다고 한다. 참고로 저 [0]을 잘 기억해두자. 

 

HSP

힛샥 프로테인...아니고 High scoring pair의 준말이다. 

 

from Bio import SearchIO
blast_qresult = SearchIO.read("my_blast.xml", "blast-xml")
blast_hsp = blast_qresult[0][0]
print(blast_hsp)
Query: No definition line
        Hit: gi|1835923315|gb|MN508834.1| Binary vector pRATIO3212-SMXL7, com...
Query range: [0:345] (1)
  Hit range: [791:1136] (-1)
Quick stats: evalue 3.7e-174; bitscore 623.45
  Fragments: 1 (345 columns)
     Query - TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGG~~~TCAAT
             |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||~~~|||||
       Hit - TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGG~~~TCAAT

Process finished with exit code 0

얘는 그래서 힛이랑 달리 색인할 때 리스트 두 개가 들어간다. 각각 맨 첫번째 힛에서 맨 첫번째 HSP를 뜻한다. 색인할 때 하나 빼먹으면 힛이 출력된다. 

 

# 얘네는 hsp 위치까지 지정해줘야 출력한다
print(blast_hsp.query_range) # 첫번째 hit의 첫번째 hsp의 query range
print(blast_hsp.evalue) # 첫번째 hit의 첫번째 hsp의 evalue
print(blast_hsp.hit_start) # 첫번째 hit의 첫번째 hsp의 hit start coordinates
print(blast_hsp.query_span) # 첫번째 hit의 첫번째 hsp의 how many residues in the query sequence(residue 몇개냐)
print(blast_hsp.aln_span) # 첫번째 hit의 첫번째 hsp의 alignment 길이
print(blast_hsp.gap_num) # 첫번째 hit의 첫번째 hsp의 gap no.
print(blast_hsp.ident_num) # 첫번째 hit의 첫번째 hsp의 identical residue no.
print(blast_hsp.aln) # 배열 뽑아줘!
(0, 345)
3.74077e-174
791
345
345
0
345
# 밑에는 배열이다 
Alignment with 2 rows and 345 columns
TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTT...AAT No
TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTT...AAT gi|1835923315|gb|MN508834.1|

참고로 힛 상태에서 저거 쓰면 오류난다. 

 

HSPFragment

가장 하위 개념. 얘는 색인할 때 []가 세 개 들어간다. 

 

blast_frag = blast_qresult[0][0][0] # 첫번째 hit의 첫번째 hsp의 첫번째 fragment
print(blast_frag)
Query: No definition line
        Hit: gi|1835923315|gb|MN508834.1| Binary vector pRATIO3212-SMXL7, com...
Query range: [0:345] (1)
  Hit range: [791:1136] (-1)
  Fragments: 1 (345 columns)
     Query - TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGG~~~TCAAT
             |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||~~~|||||
       Hit - TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATAGTGGG~~~TCAAT

Process finished with exit code 0

...HSP랑 차이 없는데? 

 

blast_frag = blast_qresult[0][0][0] # 첫번째 hit의 첫번째 hsp의 첫번째 fragment
print(blast_frag.hit) # hit sequence(sequence object)
print(blast_frag.query_start) # 시작 좌표
print(blast_frag.hit_strand) # hi sequence strand 수
ID: gi|1835923315|gb|MN508834.1|
Name: aligned hit sequence
Description: Binary vector pRATIO3212-SMXL7, complete sequence
Number of features: 0
/molecule_type=DNA
Seq('TCTCTCCAAATGAAATGAACTTCCTTATATAGAGGAAGGGTCTTGCGAAGGATA...AAT')
0
-1

아니 strand 수에 -1은 뭐여... 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

번외편-코딩테스트 풀이

Coding/Python 2022. 8. 20. 23:36

코테 본다는 얘기는 못 들은 것 같고... 그냥 면접보다가 어디 실력 좀 볼까? 해서 나온거긴 한데... 
IDE도 셋업 안 된 상태에서 jupyter 데모로 봤었음... 
근데 Python을 안 쓰면 나한테 연락을 안 할것인데 이건 뭔 상황인지 모르겠고... 

거기다가 실습할 때 썼던 파일들 올릴랬는데 깃헙 뻑나서 오전중에는 그거랑 씨름함... 해결은 봤죠. 
참고로 어제 사용한 FASTAQ파일과 오늘 사용한 FASTAQ파일은 다릅니다. 이거 구하기도 개빡셈. 


모듈 모셔오기

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
import numpy as np
import pandas as pd

세번째줄 안불러도 됨 그거 쩌리예요
판다스 넘파이는 4, 5번 풀려고 부른거 

 

1. FastaQ 파일을 Fasta 파일로 변환해라

# FASTAQ file 불러오기
handle="/home/koreanraichu/sra_data_mo.fastq"
for record in SeqIO.parse(handle,"fastq"):
    print(record.id)
# FASTA파일로 쓰기
convert_fasta=SeqIO.convert(handle, "fastq", "/home/koreanraichu/sra_data_mo.fasta","fasta")

어제는 내 컴퓨터도 아니고 Jupyter 데모여서 파일 자체 확인을 못 했는데, 이런 상황에서 잘 됐나 확인할거면 SeqIO read로 불러와보자. SeqIO는 뭉텅이로 있으면 parse만 된다. (정확히는 parse+for)

 

(변환 후)

 

2. Fasta파일에서 A, T, G, C 세기

handle2="/home/koreanraichu/sra_data_mo.fasta"
for record2 in SeqIO.parse(handle2,"fasta"):
    print(Seq(record2.seq).count("A"))
    print(type(Seq(record2.seq).count("A")))
    # Count로 세는 건 되는데 합계를 안 내준다. int형인데 더해주질 않음.

정수형인데 왜 더하질 못하고 각개출력하니... 
내가 저 상태에서 리스트로 만들려고 append 했더니 아 실패요. 그것도 각개 리스트로 나오고 A=A.append 줬더니 None됨... 

니네 왜그래... 

+그거 해결봤음. 진짜 웃긴 과정이었음... 

handle2="/home/koreanraichu/sra_data_mo.fasta"
A = []
for record2 in SeqIO.parse(handle2,"fasta"):
    A.append(Seq(record2.seq).count("A"))
print(A)

애초에 만들었던 빈 리스트를 밖으로 빼니까 되는데요? 와 신박하네. 

handle2="/home/koreanraichu/sra_data_mo.fasta"
A = [] #아니 이게 이렇게 해결된다고?
T = []
G = []
C = []
for record2 in SeqIO.parse(handle2,"fasta"):
    A.append(Seq(record2.seq).count("A"))
    T.append(Seq(record2.seq).count("T"))
    G.append(Seq(record2.seq).count("G"))
    C.append(Seq(record2.seq).count("C"))
print("A: ",sum(A),"T: ",sum(T),"G: ",sum(G),"C: ",sum(C))

코드_최종.py

 

아니 진짜 저걸로 해결되는거면 판다스 표 개판 오분전 된것도 저걸로 걍 정리해버리자. 

 

3. 전체 Sequence 세기

print(sum(len(r) for r in SeqIO.parse(handle2, "fasta")))

와 이걸 이렇게 해결보네

 

4. Top 10 length

for record2 in SeqIO.parse(handle2,"fasta"):
    record_id=np.array(record2.id)
    record_len=np.array(len(record2.seq))
    record_table=pd.DataFrame({"ID":record_id, "length":record_len},index=[0])
    print(record_table)
    # Dataframe이 이상하게 생성된다.
    """
                   ID  length
    0  SRR000021.37.2     249
    이런 식으로 항목 하나당 한 데이터프레임이 생성되는데 Array까지는 정상적으로 생성됨. 
    """

판다스로 표 만들어서 꼽아보려고 했더니 표가 개판 오분전이 되었다... 이쯤되면 array 차원부터 한번 봐야겠는데? array 차원부터 망했는데 이거? 

# A4. Top 10 length
record_id=[] # 야 이걸 이렇게 해결보네
record_len=[]
for record2 in SeqIO.parse(handle2,"fasta"):
    record_id.append(record2.id)
    record_len.append(len(record2.seq))
record_id_array=np.array(record_id)
record_len_array=np.array(record_len)

비효율적으로 보일 지 몰라도 이정도면 본인 수준에서는 장족의 발전임다... 어레이 제대로 된 것도 확인했음. 

record_table=pd.DataFrame({"ID":record_id_array,"Length":record_len_array})
print(record_table)

아 표 ㄹㅇ 이쁘게 나왔음. 

 

                 ID  Length
0     SRR000021.1.2     176
1     SRR000021.1.4      49
2     SRR000021.2.2      99
3     SRR000021.2.4     106
4     SRR000021.3.2      64
..              ...     ...
102  SRR000021.52.2     269
103  SRR000021.52.4       0
104  SRR000021.53.2     276
105  SRR000021.54.2      65
106  SRR000021.54.4     184

캬... 이 표 이쁜거 보게. 근데 여기서 끝이 아니고 length를 또 봐야 한다... 살려주세요. 일단 전에 배웠던대로 오름차순 정렬을 해 보자. 

record_table2=record_table.sort_values('Length',ascending=0)
                 ID  Length
68   SRR000021.35.2     280
104  SRR000021.53.2     276
102  SRR000021.52.2     269
64   SRR000021.33.2     269
70   SRR000021.36.2     268
..              ...     ...
63   SRR000021.32.4       0
73   SRR000021.37.4       0
65   SRR000021.33.4       0
71   SRR000021.36.4       0
53   SRR000021.27.4       0

엥? 중복값이 있구나 이거. 

 

record_table2=record_table.groupby('Length').count()
record_table2=record_table2.sort_values('Length',ascending=0)
print(record_table2.head(10))
        ID
Length    
280      1
276      1
269      2
268      1
259      1
257      1
249      1
247      1
244      1
240      1

ID가 왜 저렇게 나오냐면 저거 그룹바이로 같은 값 묶어서 세달라고 카운트 걸어서 그렇다. 즉, 269에 2 있는 건 269bp가 두개라는 얘기. 그럼 아이디는 어떻게 뽑나요 쿼리 걸어 쿼리 

근데 생각해보니까 이거 빈도수 Top10이지 길이가 아니다. 

 

record_table2=record_table2.sort_values('ID',ascending=0)
        ID
Length    
0       15
107      3
63       3
99       3
44       2
105      2
111      2
178      2
64       2
124      2

이거 뭐 하는 데이터인데 0개가 이래 많음? 

 

5. 조건부 합계(여기서는 100bp 기준으로 나눴다)

for record2 in SeqIO.parse(handle2,"fasta"):
    if len(record2.seq) > 100:
        i=0
        i=i+len(record2.seq)
    else:
        j=0
        j=j+len(record2.seq)
print(i,j)

얘도 각개로 나온다. 일단 2번부터 해결해야 할 듯. 참고로 4번 코드를 보시면 아시겠지만, 넘파이 판다스 둘 다 불렀다. 

굳이 if 하지 말고 저거 해결 봤으니까 걍 위에 표 불러서 판다스 소환하자. 

 

print(record_table[record_table.Length >= 100].sum())
print(record_table[record_table.Length < 100].sum())
ID        SRR000021.1.2SRR000021.2.4SRR000021.3.4SRR0000...
Length                                                 9984
dtype: object
ID        SRR000021.1.4SRR000021.2.2SRR000021.3.2SRR0000...
Length                                                 1955
dtype: object

이거 셀렉터 못줌? 

print("length >= 100bp:",record_table['Length'][record_table.Length >= 100].sum())
print("length < 100bp:",record_table['Length'][record_table.Length < 100].sum())
length >= 100bp: 9984
length < 100bp: 1955

되는데요? 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 MSA 해보기

Coding/Python 2022. 8. 20. 03:05

MSA: multiple sequence alignment
여기에 관한 이론적인 설명은 나중에 또 입 털어드림 ㅇㅇ 

아 참고로 MSA 관련해서 다른건 다 결과가 제대로 나왔는데 툴 관련해서 결과가 안나왔어요 
이게 암만 찾아도 답이 없어서 좀 더 알아보고 수정할 가능성이 있음 


시범조교 앞으로

오늘은 시범조교 가짓수가 좀 많은데 그 중에서도 FASTA 파일들을 좀 올리고자 한다. 이거 말고도 pfam에서 두갠가 받았는데 그건 걍 가서 암거나 받으면 된다. 

 

agrobacterium.fasta
0.02MB
enterobacter.fasta
0.02MB
lactobacillus.fasta
0.02MB

해당 파일은 박테리아의 16s rRNA 시퀀스가 들어있는 파일이다. FASTA 파일이라 메모장 있으면 일단 열 수는 있다. 리눅스에서는 gedit으로 만들고 편집하고 다 했다. (vim 안씀) rRNA 시퀀스는 Silva에서 가져왔다. 고마워요 실바! 

 

Agrobacterium

A. radiobacter를 필두로 하는 뿌리혹세균들(밑에 두놈도 뿌리에 혹 만드는지는 모름)

 

-Agrobacterium
Agrobacterium radiobacter (구 Agrobacterium tumefaciens)
Agrobacterium agile
Agrobacterium pusense
Agrobacterium salinitolerans

-Rhizobium
Rhizobium tropici (일반적으로 알고 있는 뿌리혹박테리아)
Rhizobium hainanense
Rhizobium gallicum
Rhizobium fabae

-Hoeflea
Hoeflea alexandrii
Hoeflea halophila 웬지 짠 거 좋아하게 생겼는데? 할로박테리움같은건가 
Hoeflea trophica

-Ciceribacter
Ciceribacter lividus
Ciceribacter azotifigens
Ciceribacter thiooxidans

 

Enterobacter

E.coli를 필두로 하는 장내 세균들

 

-Enterococcus
Enterococcus faecalis (어디서 많이 봤음)
Enterococcus hirae
Enterococcus avium
Enterococcus caccae

-Escherichia
Escherichia coli (그냥 어디서나 볼 수 있는 대장균)
Escherichia coli O157:H7 (감염되면 X되는 대장균)
Escherichia albertii
Escherichia fergusonii

-Shigella
Shigella boydii
Shigella sonnei
Shigella dysenteriae
Shigella flexneri

 

Lactobacillus

님들 많이 드시는 그 유산균 맞습니다. 

 

-Lactobacillus
Lactobacillus acidophilus
Lactobacillus helveticus
Lactobacillus plantarum (이분도 김치에서 발견된다)
Lactobacillus thailandensis DSM 22698 = JCM 13996 (아마 뒤에껀 strain 이름인 듯)

-Leuconostoc
Leuconostoc carnosum
Leuconostoc mesenteroides
Leuconostoc kimchii (이건 있다)
Leuconostoc miyukkimchii (저자 나와봐요 미역김치는 무슨 저세상 학명이야)

-Bifidobacterium (얘네는 방선균인데 일단 유산균으로 섭취 하기는 함... 비피더스 뭐 이런거)
Bifidobacterium bifidum
Bifidobacterium actinocoloniiforme DSM 22766 (아마 strain 이름...)
Bifidobacterium catenulatum

 

아니 근데 미역김치 학명 실화냐 대체 어디서 발견하면 학명이 저렇게 되는건데요 아니 저기 김치 하나 있는거 뭔데 김치요 참고로 C.kimchii(구 L.kimchii)는 데이터 없어서 못 넣었음... 김치 파티 각 나왔다

 


시퀀스 가져오기

MSA를 할래도 시퀀스를 가져와야 한다. 그것도 하나 말고 뭉텅이로. 여기서도 read()와 parse()로 나뉘긴 한데, 이거는 기존에 SeqIO로 읽을 때처럼 시퀀스 갯수가 여러 개 있어도 single alignment면 read()로 읽을 수 있다. 

참고로 쿡북 예제로 스톡홀름 파일이 나오긴 했는데 FASTA도 불러올수는 있다. 

 

from Bio import AlignIO
alignment = AlignIO.read("/home/koreanraichu/agrobacterium.fasta", "fasta")
print(alignment)

MSA를 할 때는 SeqIO가 아니라 AlignIO로 불러오면 된다. 

Alignment with 14 rows and 1561 columns
AUUCUCAACUUGAGAGUUUGAUCCUGGCUCAGAACGAACGCUGG...AAG AB102732.157.1651
................................AACGAACGCUGG...... AB680818.1.1406
................................AUUGAACGCUGG...... AB680959.1.1462
................................AACGAACGCUGG...... AB682466.1.1406
................................AACGAACGCUGG...... AB682468.1.1406
......................................CGCUGG...... AB969785.1.1417
............AGAGUUUGAUCCUGGCUCAGAACGAACGCUGG...... AJ786600.1.1449
.................................................. DQ835303.1.1353
.................................................. GU564401.1.1419
................................AACGAACGCUGG...... JF957616.1.1410
.................................................. JQ230000.1.1392
.................................................. KP142169.1.1341
............AGAGUUUGAUCAUGGCUCAGAACGAACGCUGG...... KU975391.1.1451
...............................A-ACGAACGCUGG...... KX510117.1.1407

여기서 말하는 컬럼은 글자 하나. PDB파일처럼 얘도 알파벳 하나가 컬럼이다. 

for record in alignment:
    print(record.id, record.description) # 예제 코드는 Sequence와 ID를 출력하라고 했는데 기니까 다른걸로 바꿔보자.
AB102732.157.1651 AB102732.157.1651 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Agrobacterium radiobacter
AB680818.1.1406 AB680818.1.1406 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Rhizobium tropici
AB680959.1.1462 AB680959.1.1462 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Agrobacterium agile
AB682466.1.1406 AB682466.1.1406 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Rhizobium hainanense
AB682468.1.1406 AB682468.1.1406 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Rhizobium gallicum
AB969785.1.1417 AB969785.1.1417 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Agrobacterium pusense
AJ786600.1.1449 AJ786600.1.1449 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Hoeflea;Hoeflea alexandrii
DQ835303.1.1353 DQ835303.1.1353 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Rhizobium fabae
GU564401.1.1419 GU564401.1.1419 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Hoeflea;Hoeflea halophila
JF957616.1.1410 JF957616.1.1410 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Hoeflea;Hoeflea phototrophica
JQ230000.1.1392 JQ230000.1.1392 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Ciceribacter;Ciceribacter lividus
KP142169.1.1341 KP142169.1.1341 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Agrobacterium salinitolerans
KU975391.1.1451 KU975391.1.1451 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Ciceribacter;Ciceribacter thiooxidans
KX510117.1.1407 KX510117.1.1407 Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Rhizobiaceae;Allorhizobium-Neorhizobium-Pararhizobium-Rhizobium;Ciceribacter azotifigens

이런 식으로 레코드 만들어도 된다. 예제에는 ID랑 시퀀스로 되어 있던거 ID랑 description으로 바꿨다... 시퀀스가 너무 길어... 출력 format을 지정하거나 쌩으로 레코드를 가져올 수도 있는데, 길이가... 길이가...... 정말 장난없다... 

 

from Bio import AlignIO
alignments = AlignIO.parse("/home/koreanraichu/lactobacillus.fasta", "fasta")
for alignment in alignments:
    print(alignment)
Alignment with 12 rows and 1606 columns
...............UUUGAUCAUGGCUCAGGACGAACGCUGGC...... AB008203.1.1553
................UUGAUCAUGGCUCAGGACGAACGCUGGC...... AB008210.1.1552
.........................................CGC...... AB022925.1.1450
..........................................GC...... AB023242.1.1446
............GAGUUUGAUCCUGGCUCAGGACGAACGCUGGC...... AB112083.1.1557
...........GGGUUUCGAUUCUGGCUCAGGAUGAACGCUGGC...... AB437354.1.1520
.............GUUUCGAUUCUGGCUCAGGAUGAACGCUGGC...... AB437356.1.1519
...............................GAUGAACGCUGGC...... AF173986.1.1505
...........AGAGUUUGAUCCUGGCUCAGGAUGAACGCUGGC...... AYZK01000017.137.1688
UUUUUUUGUGGAGGGUUUGAUUCUGGCUCAGGAUGAACGCUGGC...AGA CP011786.1420993.1422533
UUUUUUUGUGGAGGGUUUGAUUCUGGCUCAGGAUGAACGCUGGC...AGA CP011786.1514900.1516440
...............................GAUGAACGCUGGC...... KX232108.1.1480

Parse로 가져와도 똑같긴 한데 SeqIO나 얘나 파싱으로 가져오면 for문 있어야 제대로 뜬다. 

 

from Bio import AlignIO
handle="/home/koreanraichu/lactobacillus.fasta"
for alignments in AlignIO.parse(handle, "fasta",seq_count=2):
    print("Alignment lentgh %i" % alignments.get_alignment_length())
    for record in alignments:
        print(record.description)
Alignment lentgh 1606
AB008203.1.1553 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus acidophilus
AB008210.1.1552 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactobacillus;Lactobacillus helveticus
Alignment lentgh 1606
AB022925.1.1450 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Leuconostoc;Leuconostoc carnosum
AB023242.1.1446 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Leuconostoc;Leuconostoc mesenteroides
Alignment lentgh 1606
AB112083.1.1557 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lactiplantibacillus;Lactobacillus plantarum
AB437354.1.1520 Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium;Bifidobacterium adolescentis
Alignment lentgh 1606
AB437356.1.1519 Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium;Bifidobacterium bifidum
AF173986.1.1505 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Leuconostoc;Leuconostoc kimchii
Alignment lentgh 1606
AYZK01000017.137.1688 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Lacticaseibacillus;Lactobacillus thailandensis DSM 22698 = JCM 13996
CP011786.1420993.1422533 Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium;Bifidobacterium actinocoloniiforme DSM 22766
Alignment lentgh 1606
CP011786.1514900.1516440 Bacteria;Actinobacteriota;Actinobacteria;Bifidobacteriales;Bifidobacteriaceae;Bifidobacterium;Bifidobacterium actinocoloniiforme DSM 22766
KX232108.1.1480 Bacteria;Firmicutes;Bacilli;Lactobacillales;Lactobacillaceae;Leuconostoc;Leuconostoc miyukkimchii

끊어뽑기도 된다. 코드에 있는 숫자 수정하면 3개 4개 묶는 것도 된다. 

 

쓰기

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Align import MultipleSeqAlignment
from Bio import AlignIO
align1=MultipleSeqAlignment([
    SeqRecord(Seq("GGCC"),id="HaeIII"),
    SeqRecord(Seq("G-CC"),id="id2"),
    SeqRecord(Seq("G--C"),id="id3")
])
align2=MultipleSeqAlignment([
    SeqRecord(Seq("GAATTC"),id="EcoRi"),
    SeqRecord(Seq("G-ATTC"),id="id5"),
    SeqRecord(Seq("G--TTC"),id="id6")
])
my_alignment=[align1,align2]
# 여기까지 시퀀스를 만들고

AlignIO.write(my_alignment, "/home/koreanraichu/sequence.fasta", "fasta")
# 여기서 저(별)장

나 저거 IL-16 전에 썼던거 안된 이유 알아냈음... ㅋㅋㅋㅋㅋㅋ Seq=("")가 아니라 Seq("")였다. 아무튼 이런 식으로 쓸 수 있고 주석 있어서 대충 아시겠지만 순서대로 Alignment 요소를 만들고->시퀀스 만들고->파일로 저장하는 코드이다. 

이런 식으로 저장된다. (그야 내가 FASTA로 저장했으니까...)

참고로 미리 말해두겠지만 본인 리눅스에서 FASTA, PHYLIP, 스톡홀름, clustalW 다 지에딧으로 연다. 그렇게 열리고… 

이게 필립 파일인데 얘는 변환하려면 시퀀스 길이가 다 같아야 하고, 공백이 -이어야 한다. 공백에 . 있으면 지원 안 한다고 오류 토한다. 

 

선생님 이거 변환 됩니까? 

변환이요? 뭐 클러스탈이나 스톡뭐시기 말하는거면 된다. 

 

count = AlignIO.convert("/home/koreanraichu/agrobacterium.fasta", "fasta", "/home/koreanraichu/agrobacterium.sth",
                        "stockholm")
print("Converted %i alignments" % count)
alignments = AlignIO.parse("/home/koreanraichu/agrobacterium.fasta", "fasta")
count = AlignIO.write(alignments, "/home/koreanraichu/agrobacterium.aln","clustal")
print("Converted %i alignments" % count)
# ClustalW

위는 convert를 사용해 스톡홀름 파일로 바꾼거고, 아래는 parse와 write를 사용해 클러스탈로 바꾼 것이다. 

 

ClustalW
stockholm

위가 clustalw, 아래가 스톡홀름 포맷. 

 

alignment = AlignIO.read("/home/koreanraichu/PF00096_seed.txt", "stockholm")
AlignIO.write([alignment], "PF05371_seed.aln", "clustal")
print("Converted %i alignments" % count)
# read 후 리스트화해서 변환

read로 읽은 다음 리스트화해서 변환하는 것도 된다. 

count = AlignIO.convert("/home/koreanraichu/PF08449_seed.txt", "stockholm", "/home/koreanraichu/PF08449_seed.phy",
                        "phylip")
print("Converted %i alignments" % count)
# 이거라면 필립 될거같은데?

공백이 -이고 시퀀스 bp가 전부 동일하다는 전제하에 Phylip 포맷으로 만드는 것도 된다. 

 

alignment2 = AlignIO.read("/home/koreanraichu/PF08449_seed.txt", "stockholm")
name_mapping = {}
for i, record in enumerate(alignment):
    name_mapping[i] = record.id
    record.id = "seq%i" % i
AlignIO.write([alignment], "PF08449_seed_ID.phy", "phylip")
# 오 뭔진 모르겠지만 ID가 숫자가 된 건가

딕셔너리화 한 다음 아이디 새로 지정해주는 것도 된다. 위 코드를 실행하면 

이렇게 된다. 잘 보면 ID 영역에 seq(숫자)가 들어가 있다. 

 

출력 형식만 바꾸기

이건 뭔 소리냐... 여는 파일은 FASTA파일인데 파일 수정 없이 clustalw나 스톡홀름 형식으로 출력할 수 있다. (phylip은 아마 글자수 같고 공백 -여야 먹힐듯) 참고로 결과는 길어서 생략. 

 

from Bio import AlignIO
alignment = AlignIO.read("/home/koreanraichu/lactobacillus.fasta", "fasta")
print(format(alignment, "clustal"))

해당 코드는 FASTA파일을 클러스탈 형식으로 보여달라는 얘기고, 스톡홀름으로 볼 수도 있다. 

 

슬라이싱

넘파이 배열 자르는것처럼 얘도 슬라이싱이 된다. (넘파이 2차원 배열 잘라먹는 거 생각하면 된다)

 

from Bio import AlignIO
alignment = AlignIO.read("/home/koreanraichu/PF08449_seed.txt", "stockholm")
print(alignment[0:5])
ISLIPISMIMVGCCSNVISLELIMKQSQ-SH-A---------IL...TSG S35B4_DICDI/7-328
SFVLILSLVFGGCCSNVISFEHMVQGSN-INLG---------NI...YGS YEA4_KLULA/2-321
NSLKAFALVFGGCCSNVITFETLMSNET-GSIN---------NL...LGS YEA4_YEAST/3-327
MIASALSFIFGGCCSNAYALEALVREFP-SS-G---------IL...SAR YEA4_SCHPO/1-304
--GVMLSLIFGGCCSNVFALESIIKVEP-GA-G---------TL...L-- Q1K947_NEUCR/81-406

이렇게 하면 pandas의 head()처럼 위에서 다섯개만 보여준다. 

 

print(alignment[0,1]) # 0번째 인덱스의 두번째 컬럼
S

이런 식으로 2차원 인덱싱과 슬라이싱도 되고

print(alignment[:,6]) # 무슨 컬럼을 가져온거냐
SSASSGTCLSSASLASL

이거는 전체 인덱스에서 일곱번째 글자들을 가져온 것. (실화다) 

print(alignment[0:5,0:5]) # index+column 범위를 지정할 수 있다
ISLIP S35B4_DICDI/7-328
SFVLI YEA4_KLULA/2-321
NSLKA YEA4_YEAST/3-327
MIASA YEA4_SCHPO/1-304
--GVM Q1K947_NEUCR/81-406

이런 식으로 범위 지정해서 잘라오는 것도 되고 특정 인덱스, 컬럼부터 다 가져오는 것도 된다. 저거 저대로 잘라먹는 것도 되고, Numpy 데려와서 배열화하는 것도 가능하다. 근데 예제대로 했더니 오류남... (안되는 건 아닌데 오류뜬다) 

 

본격적인 정보 얻기

아까까지 대체 뭐 한겨... 

 

from Bio import AlignIO
alignment = AlignIO.read("/home/koreanraichu/lactobacillus.fasta", "fasta") # 아이고 유산균씨 오랜만입니다
substitutions = alignment.substitutions
print(substitutions)
       .       A       C       G       U
. 1365.0   321.0   310.5   461.5   406.0
A  321.0 19985.0   791.0  2218.5   978.5
C  310.5   791.0 18856.0   957.5  2048.0
G  461.5  2218.5   957.5 26642.0  1015.0
U  406.0   978.5  2048.0  1015.0 16007.0

이거는 rRNA라 티민 대신 우라실이 나왔다. 저건 아마도 해당 염기가 다른걸로 바뀐 비율? 횟수? 이런 걸 봐주는 것 같다. 예를 들자면 아데닌이 시토신, 구아닌, 우라실로 바뀌는 뭐 그런거. (공백 공백은 뭐여...) 저기서 옵션으로 .select() 주면 출력 순서가 바뀐다. (예: .select(“AUGC.”)으로 주면 행렬 인덱스랑 컬럼 순서가 AUCG.로 바뀐다)

 

Tool

이 부분이 문제가 많아요... OTL 둘 다 깔고 해야 하나... 

 

ClustalW

from Bio.Align.Applications import ClustalwCommandline
cline = ClustalwCommandline("clustalw2", infile="/home/koreanraichu/lactobacillus.fasta")
print(cline)

이런 식으로 불러서 

clustalw2 -infile=/home/koreanraichu/lactobacillus.fasta

니가 왜 거기서 나옴???

 

이거 뭐 트리 만드는 파일 만들어준다매 왜 안해줘요. 이럴거면 그냥 메가 쓰면 안되냐

 

MUSCLE

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="/home/koreanraichu/lactobacillus.aln", out="/home/koreanraichu/lactobacillus.txt")
print(cline)

얘는 이런 식으로 부르면 

muscle -in /home/koreanraichu/lactobacillus.aln -out /home/koreanraichu/lactobacillus.txt

이러시는 이유가 있으실 것 아니예요. 

 

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="/home/koreanraichu/lactobacillus.fasta", out="/home/koreanraichu/lactobacillus.txt",
clw=True)
print(cline)
muscle -in /home/koreanraichu/lactobacillus.fasta -out /home/koreanraichu/lactobacillus.txt -clw

이걸 쓰면 clustalW 비슷하게 출력된다. (물론 파일은 생기지 않았습니다)

from Bio.Align.Applications import MuscleCommandline
cline = MuscleCommandline(input="/home/koreanraichu/lactobacillus.fasta", out="/home/koreanraichu/lactobacillus.txt",
clwstrict=True)
print(cline)
muscle -in /home/koreanraichu/lactobacillus.fasta -out /home/koreanraichu/lactobacillus.txt -clwstrict

이걸 쓰면 clustalW headline을 쓴다는데 파일이 없다. 착한 사람만 보이는 파일인가... 그래서 얘네는 못했다. 

 

Pairwise

왜 블래스트에서 뭐 검색하면 결과에 줄 그어서 나오는 그거 말하는 거 맞다. 

 

from Bio import pairwise2
from Bio import SeqIO
seq1 = SeqIO.read("/home/koreanraichu/alpha.faa", "fasta")
seq2 = SeqIO.read("/home/koreanraichu/beta.faa", "fasta")
alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)
print(alignments)

근데 이거 파싱으로 개별 레코드끼리 비교 못하나... 나중에 다시 함 트라이해 볼 예정. 

 

[Alignment(seqA='MV-LSPADKTNV---K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----T--PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=217), Alignment(seqA='MV-LSPADKTNV--K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----T-PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LSPADKTNV--K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----TP-EEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LS-PADKTNV--K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-TP------EEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=217), Alignment(seqA='MV-LSPADKTNV--K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHLTP------EEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LSPADKTN-V-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----T-PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LSPADKTNV-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----TPEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=215), Alignment(seqA='MV-LS-PADKTNV-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-TP-----EEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LSPADKTNV-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHLTP-----EEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=215), Alignment(seqA='MV-LSPADKT-NV-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-', seqB='MVHL-----TPE-EKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H', score=72.0, start=0, end=216), Alignment(seqA='MV-LS-PADKTNV-K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-

뭔진 모르겠지만 제가 잘못했습니다. (저거 len으로 보면 길이 장난없다)

 

print(pairwise2.format_alignment(*alignments[0]))

이렇게 쳐 주면 비로소 우리가 생각하는(?)

MV-LSPADKTNV---K-A--A-WGKVGAHAG---EY-GA-EALE-RMFLSF----PTTK-TY--FPHFDL-SH-G--S---AQVK-G------HGKKV--A--DA-LTNAVAHVD-DMPNALS----A-LSD-LHAH--KLR-VDPV-NFKL-LSHCL---LVT--LAAHLPA----EFTPA-VH-ASLDKFLAS---VSTV------LTS--KYR-
|| |     |     | |  | ||||        |  |  |||  |  |      |    |   |  |   |  |  |   | |  |      |||||  |  |  |    ||   |  | |     | ||  |  |  ||  |||  ||   |   |   ||   | ||       ||||  |  |      |    |  |      |    ||  
MVHL-----T--PEEKSAVTALWGKV-----NVDE-VG-GEAL-GR--L--LVVYP---WT-QRF--F--ES-FGDLSTPDA-V-MGNPKVKAHGKKVLGAFSD-GL----AH--LD--N-L-KGTFATLS-EL--HCDKL-HVDP-ENF--RL---LGNVLV-CVL-AH---HFGKEFTP-PV-QA------A-YQKV--VAGVANAL--AHKY-H
  Score=72

이 결과가 나온다. BLAST에서도 이런 식으로 나온다. 

 

BLOSUM62와 함께 정렬...아니고 줄을 

from Bio import pairwise2
from Bio import SeqIO
from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load("BLOSUM62")
seq1 = SeqIO.read("/home/koreanraichu/alpha.faa", "fasta")
seq2 = SeqIO.read("/home/koreanraichu/beta.faa", "fasta")
alignments = pairwise2.align.globalds(seq1.seq, seq2.seq,blosum62, -10, -0.5)
print(pairwise2.format_alignment(*alignments[0]))

이거 [0] 빼면요? 뭔진 모르겠지만 잘못했어요 소리가 절로 나온다. (길이 장난없음) 

 

MV-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS-----HGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYR
|| |.|..|..|.|.||||  ...|.|.|||.|.....|.|...|..| |||     .|...||.|||||..|.....||.|........||.||..||.|||.||.||...|...||.|...||||.|.|...|..|.|...|..||.
MVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
  Score=292.5

.globalds는 전역(전체 시퀀스)를 보는거고 .localds로 부분만 볼 수도 있다. 

 

alignments = pairwise2.align.localds("LSPADKTNVKAA", "PEEKSAV", blosum62, -10, -1)
3 PADKTNV
  |..|..|
1 PEEKSAV
  Score=16

이런 식으로 부분부분 볼 수도 있다. 엥? 근데 저렇게 해 놓으면 어딘지 어떻게 알아요? 

 

print(pairwise2.format_alignment(*alignments[0],full_sequences=True))
LSPADKTNVKAA
  |..|..|   
--PEEKSAV---

개발자는 다 계획이 있었다. 

 

Pairwise aligner

from Bio import Align
aligner = Align.PairwiseAligner()
seq1="GAATTC"
seq2="GCATTC"
score = aligner.score(seq1, seq2)
print(score)
5.0

일단 모셔보래서 모셔봤습니다. 이거는 score 뽑는거고 

from Bio import Align
aligner = Align.PairwiseAligner() # 그래서 모셔왔습니다 
seq1="GAATTC"
seq2="GCATTC"
alignments = aligner.align(seq1, seq2)
for alignment in alignments:
    print(alignment)

이렇게 치면 

G-AATTC
|-|-|||
GCA-TTC

GA-ATTC
|--||||
G-CATTC

G-AATTC
|--||||
GC-ATTC

GAATTC
|.||||
GCATTC

이렇게 나온다. 전 맨 밑에껄로 보여주세요. 픽 안되나

 

GAATTC
||||||
GAATTC

100% 일치하는 시퀀스는 이런 식으로 보여준다. aligner.mode="local"을 주면 위에것처럼 중구난방으로 안 뽑아주고 

GAATTC
 |||||
CAATTC

이런 식으로 뽑아준다. 

 

Wildcard

일반적으로 검색할 때(구글이나 파일같은 거) *이 와일드카드지만, 여기서는 와일드카드를 지정할 수 있다. 

# wildcard를 지정한 다음 쓸 수도 있다
aligner.wildcard = "?"
seq3="?GCC"
seq4="GGCC"
alignments2 = aligner.align(seq3, seq4)
for alignment in alignments2:
    print(alignment)
?G-CC
 |-||
 GGCC

?GCC
 |||
GGCC

여기서는 일단 ?로 지정해보자. 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython SeqIO 써보기

Coding/Python 2022. 8. 20. 02:43

파이참 쓰다보니 키보드 먹통돼서 재부팅 여러번 했다. 이것도 갈 때 된 듯... 


*참고로 그냥 리드나 파싱으로 긁어오는 방법은 첫 글에서 썼으므로 생략한다. 

 

Iteration 활용하기

근데 들어도 이건 뭔 기능인가 싶긴 함. 

 

from Bio import SeqIO
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")]
print(identifiers)

뭐야 wrap 넣어줘요... 

레코드와 for문을 조합해서 원하는 데이터(ID나 시퀀스, description)만 가져올 수 있다. 

from Bio import SeqIO
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")]
print(identifiers)
['Z78533.1', 'Z78532.1', 'Z78531.1', 'Z78530.1', 'Z78529.1', 'Z78527.1', 'Z78526.1', 'Z78525.1', 'Z78524.1', 'Z78523.1', 'Z78522.1', 'Z78521.1', 'Z78520.1', 'Z78519.1', 'Z78518.1', 'Z78517.1', 'Z78516.1', 'Z78515.1', 'Z78514.1', 'Z78513.1', 'Z78512.1', 'Z78511.1', 'Z78510.1', 'Z78509.1', 'Z78508.1', 'Z78507.1', 'Z78506.1', 'Z78505.1', 'Z78504.1', 'Z78503.1', 'Z78502.1', 'Z78501.1', 'Z78500.1', 'Z78499.1', 'Z78498.1', 'Z78497.1', 'Z78496.1', 'Z78495.1', 'Z78494.1', 'Z78493.1', 'Z78492.1', 'Z78491.1', 'Z78490.1', 'Z78489.1', 'Z78488.1', 'Z78487.1', 'Z78486.1', 'Z78485.1', 'Z78484.1', 'Z78483.1', 'Z78482.1', 'Z78481.1', 'Z78480.1', 'Z78479.1', 'Z78478.1', 'Z78477.1', 'Z78476.1', 'Z78475.1', 'Z78474.1', 'Z78473.1', 'Z78472.1', 'Z78471.1', 'Z78470.1', 'Z78469.1', 'Z78468.1', 'Z78467.1', 'Z78466.1', 'Z78465.1', 'Z78464.1', 'Z78463.1', 'Z78462.1', 'Z78461.1', 'Z78460.1', 'Z78459.1', 'Z78458.1', 'Z78457.1', 'Z78456.1', 'Z78455.1', 'Z78454.1', 'Z78453.1', 'Z78452.1', 'Z78451.1', 'Z78450.1', 'Z78449.1', 'Z78448.1', 'Z78447.1', 'Z78446.1', 'Z78445.1', 'Z78444.1', 'Z78443.1', 'Z78442.1', 'Z78441.1', 'Z78440.1', 'Z78439.1']

예제에 있단 파일로 돌렸더니 94개 나온다. 근데 94개가 저렇게 중구난방으로 나오니까 힘들다... 이거 넘파이 배열로 바꿔서 깔쌈하게 보여주면 안되나? 

from Bio import SeqIO
import numpy as np
identifiers = [seq_record.id for seq_record in SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")]
identifiers=np.array(identifiers)
identifiers=identifiers.reshape(2,47)
print(identifiers)

그래서 해봤다. 

[['Z78533.1' 'Z78532.1' 'Z78531.1' 'Z78530.1' 'Z78529.1' 'Z78527.1'
  'Z78526.1' 'Z78525.1' 'Z78524.1' 'Z78523.1' 'Z78522.1' 'Z78521.1'
  'Z78520.1' 'Z78519.1' 'Z78518.1' 'Z78517.1' 'Z78516.1' 'Z78515.1'
  'Z78514.1' 'Z78513.1' 'Z78512.1' 'Z78511.1' 'Z78510.1' 'Z78509.1'
  'Z78508.1' 'Z78507.1' 'Z78506.1' 'Z78505.1' 'Z78504.1' 'Z78503.1'
  'Z78502.1' 'Z78501.1' 'Z78500.1' 'Z78499.1' 'Z78498.1' 'Z78497.1'
  'Z78496.1' 'Z78495.1' 'Z78494.1' 'Z78493.1' 'Z78492.1' 'Z78491.1'
  'Z78490.1' 'Z78489.1' 'Z78488.1' 'Z78487.1' 'Z78486.1']
 ['Z78485.1' 'Z78484.1' 'Z78483.1' 'Z78482.1' 'Z78481.1' 'Z78480.1'
  'Z78479.1' 'Z78478.1' 'Z78477.1' 'Z78476.1' 'Z78475.1' 'Z78474.1'
  'Z78473.1' 'Z78472.1' 'Z78471.1' 'Z78470.1' 'Z78469.1' 'Z78468.1'
  'Z78467.1' 'Z78466.1' 'Z78465.1' 'Z78464.1' 'Z78463.1' 'Z78462.1'
  'Z78461.1' 'Z78460.1' 'Z78459.1' 'Z78458.1' 'Z78457.1' 'Z78456.1'
  'Z78455.1' 'Z78454.1' 'Z78453.1' 'Z78452.1' 'Z78451.1' 'Z78450.1'
  'Z78449.1' 'Z78448.1' 'Z78447.1' 'Z78446.1' 'Z78445.1' 'Z78444.1'
  'Z78443.1' 'Z78442.1' 'Z78441.1' 'Z78440.1' 'Z78439.1']]

참고로 numpy array는 5*6을 만들려면 진짜 데이터가 30개 있어야된다. 즉, 저게 최선이다. 

 

레코드 반복

이것도 Iteration을 이용한다. 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta")

first_record = next(record_iterator)
print(first_record.id)
print(first_record.description)

second_record = next(record_iterator)
print(second_record.id)
print(second_record.description)
from Bio import SeqIO
first_record_2 = next(SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"))
print(first_record_2)

위 코드와 아래 코드 둘 다 iteration을 활용하는 건 맞다. 

2B10_1|Chains
2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
2B10_2|Chains
2B10_2|Chains B, D|Cytochrome c iso-1|Saccharomyces cerevisiae (4932)
ID: 2B10_1|Chains
Name: 2B10_1|Chains
Description: 2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
Number of features: 0
Seq('TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTS...QGL')

위 코드는 next 수가 두 개라 앞에서 두 데이터+원하는 부분만 띄운 거고 아래는 판다스의 head()마냥 맨 위에걸 띄운거다. 아니 그러면 next를 일일이 쳐야되나요? for문 안됨? 있어봐요 해봅시다. 

 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.fasta", "fasta")
for i in range(0,4):
    record = next(record_iterator)
    print(record.id)
    print(record.description)
gi|2765658|emb|Z78533.1|CIZ78533
gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765657|emb|Z78532.1|CCZ78532
gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765656|emb|Z78531.1|CFZ78531
gi|2765656|emb|Z78531.1|CFZ78531 C.fasciculatum 5.8S rRNA gene and ITS1 and ITS2 DNA
gi|2765655|emb|Z78530.1|CMZ78530
gi|2765655|emb|Z78530.1|CMZ78530 C.margaritaceum 5.8S rRNA gene and ITS1 and ITS2 DNA

되는데요? 

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.fasta", "fasta")

i=0
while i<4
    record = next(record_iterator)
    print(record.id)
    print(record.description)
    i=i+1

참고로 While문도 된다. 

 

레코드 목록 뽑기

from Bio import SeqIO
records = list(SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank"))
print(records)

이게 무옵션 기본 코드이다. 

print("First record:",records[0])
print("Last record:",records[-1])

인덱싱도 된다. 

 

first=records[0]
last=records[-1]
# 변수 안 주면 풀로 나온다...
print("First record ID:",first.id)
print("Last record ID:",last.id)

인덱싱 하면서 특정 인자만 보는 것도 된다. (위 코드는 처음과 맨 마지막 데이터의 아이디를 보여주는 코드)

 

데이터 추출

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/ls_orchid.gbk", "genbank")
first_record = next(record_iterator)
# 첫 번째 레코드를 가져온다.
print(first_record.annotations) # Annotations
print(first_record.annotations.keys()) # Key
print(first_record.annotations.values()) # Values
{'molecule_type': 'DNA', 'topology': 'linear', 'data_file_division': 'PLN', 'date': '30-NOV-2006', 'accessions': ['Z78533'], 'sequence_version': 1, 'gi': '2765658', 'keywords': ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'source': 'Cypripedium irapeanum', 'organism': 'Cypripedium irapeanum', 'taxonomy': ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium'], 'references': [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]} 
# 그냥 주석을 출력했을 때
dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'gi', 'keywords', 'source', 'organism', 'taxonomy', 'references'])
# 키 주세요
dict_values(['DNA', 'linear', 'PLN', '30-NOV-2006', ['Z78533'], 1, '2765658', ['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2'], 'Cypripedium irapeanum', 'Cypripedium irapeanum', ['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliophyta', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Cypripedioideae', 'Cypripedium'], [Reference(title='Phylogenetics of the slipper orchids (Cypripedioideae: Orchidaceae): nuclear rDNA ITS sequences', ...), Reference(title='Direct Submission', ...)]])
# 값 주세요

Annotation은 저장 형태가 딕셔너리라 이렇게 나온다고... 딕셔너리 형태라는 건 Key:value로 저장되어 있기 때문에 키값이나 밸류값만 볼 수 있다는 얘기이기도 하다. 

 

print(first_record.annotations["source"]) # Annotation+selector(키 중 하나를 선택)
Cypripedium irapeanum # 결과

이런 식으로 키 중 하나를 선택해서 거기에 대한 밸류를 볼 수 있다. 

 

# 아래 코드는 리스트업 코드이다.
all_species = []
for seq_record in record_iterator:
    all_species.append(seq_record.annotations["organism"])
print(all_species)

숫자만 맞으면 어레이 만들어도 된다. 

all_species = [
    seq_record.annotations["organism"]
    for seq_record in SeqIO.parse("ls_orchid.gbk", "genbank")
]
print(all_species)

for문에 던져버려도 되고. 

 

수정

from Bio import SeqIO
record_iterator = SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10_mod.fasta", "fasta") # 불러와서(사본 만듦)
first_record = next(record_iterator)
first_record.id="SC_2B10_1" # ID를 바꿨다.
print(first_record.id)

원본 파일에 손대기 좀 그래서 사본 만들어서 수정했다. 수정은 이런 식으로 하면 된다. 

 

웹에서 받아오기

아 솔직치 다운받기 귀찮은데 아이디 알면 걍 웹에서 뜯어오면 안돼요? 됨. 

 

파일 주세요 Genbank여... 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="fasta", retmode="text", id="60391722"
) as handle:
    seq_record = SeqIO.read(handle, "fasta")
print(seq_record)

순순히 파일을 내놓는다면 유혈사태는 면할 것이다! (메일 내꺼임)

ID: AB018076.3
Name: AB018076.3
Description: AB018076.3 Homo sapiens hedgehog gene, complete cds
Number of features: 0
Seq('ATGTCTCCCGCCCGGCTCCGGCCCCGACTGCACTTCTGCCTGGTCCTGTTGCTG...ACC')

그래서 받아옴

 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="60391722"
) as handle:
    seq_record = SeqIO.read(handle, "gb")
print(seq_record)

파일 또 내놔! 

 

ID: AB018076.3
Name: AB018076
Description: Homo sapiens hedgehog gene, complete cds
Number of features: 8
/molecule_type=DNA
/topology=linear
/data_file_division=PRI
/date=24-FEB-2005
/accessions=['AB018076', 'AB010092', 'AB018075']
/sequence_version=3
/keywords=['']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Expression of Sonic hedgehog and its receptor Patched/Smoothened in human cancer cell lines and embryonic organs', ...), Reference(title='Direct Submission', ...)]
/comment=On or before Mar 1, 2005 this sequence version replaced AB010092.1,
AB018075.1, AB018076.2.
Seq('ATGTCTCCCGCCCGGCTCCGGCCCCGACTGCACTTCTGCCTGGTCCTGTTGCTG...ACC')

아, 이건 genbank 포맷이다. 파스타랑 저거랑 다 된다. 단백질 이름이요? 아니 진짜 있는건데??? 

 

from Bio import Entrez
from Bio import SeqIO
Entrez.email = "blackholekun@gmail.com"
with Entrez.efetch(
    db="nucleotide", rettype="gb", retmode="text", id="60391722,60391706,1519245148" # 여러개도 된다.
) as handle:
    for seq_record in SeqIO.parse(handle, "gb"):
        print(seq_record.description)
Homo sapiens hedgehog gene, complete cds
Homo sapiens hedgehog gene, complete cds
Homo sapiens sonic hedgehog signaling molecule (SHH), transcript variant 1, mRNA

아 물론 뭉텅이로 받는것도 된다. 

 

파일 주세요 swissprot이여... 

솔직히 이정도면 PDB랑 유니프롯이랑 TAIR랑 펍켐도 지원해줘야된다. 개발자 일해라. 

 

from Bio import ExPASy
from Bio import SeqIO
with ExPASy.get_sprot_raw("O23729") as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record)

스위스프롯에서는 이렇게 가져온다. 

ID: O23729
Name: CHS3_BROFI
Description: RecName: Full=Chalcone synthase 3; EC=2.3.1.74; AltName: Full=Naringenin-chalcone synthase 3;
Database cross-references: EMBL:AF007097, SMR:O23729, PRIDE:O23729, UniPathway:UPA00154, GO:GO:0016210, GO:GO:0009813, Gene3D:3.40.47.10, InterPro:IPR012328, InterPro:IPR001099, InterPro:IPR018088, InterPro:IPR011141, InterPro:IPR016039, PANTHER:PTHR11877, Pfam:PF02797, Pfam:PF00195, PIRSF:PIRSF000451, SUPFAM:SSF53901, PROSITE:PS00441
Number of features: 2
/molecule_type=protein
/accessions=['O23729']
/protein_existence=2
/date=15-JUL-1999
/sequence_version=1
/date_last_sequence_update=01-JAN-1998
/date_last_annotation_update=10-FEB-2021
/entry_version=75
/gene_name=Name=CHS3;
/organism=Bromheadia finlaysoniana (Orchid)
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', 'Tracheophyta', 'Spermatophyta', 'Magnoliopsida', 'Liliopsida', 'Asparagales', 'Orchidaceae', 'Epidendroideae', 'Vandeae', 'Adrorhizinae', 'Bromheadia']
/ncbi_taxid=['41205']
/comment=FUNCTION: The primary product of this enzyme is 4,2',4',6'- tetrahydroxychalcone (also termed naringenin-chalcone or chalcone) which can under specific conditions spontaneously isomerize into naringenin.
CATALYTIC ACTIVITY: Reaction=4-coumaroyl-CoA + 2 H(+) + 3 malonyl-CoA = 2',4,4',6'-   tetrahydroxychalcone + 3 CO2 + 4 CoA; Xref=Rhea:RHEA:11128,   ChEBI:CHEBI:15378, ChEBI:CHEBI:16526, ChEBI:CHEBI:57287,   ChEBI:CHEBI:57355, ChEBI:CHEBI:57384, ChEBI:CHEBI:77645; EC=2.3.1.74;   Evidence={ECO:0000255|PROSITE-ProRule:PRU10023};
PATHWAY: Secondary metabolite biosynthesis; flavonoid biosynthesis.
SIMILARITY: Belongs to the thiolase-like superfamily. Chalcone/stilbene synthases family. {ECO:0000305}.
/references=[Reference(title='Molecular cloning and sequence analysis of chalcone synthase cDNAs of Bromheadia finlaysoniana.', ...)]
/keywords=['Acyltransferase', 'Flavonoid biosynthesis', 'Transferase']
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE')

뭔진 모르겠지만 일단 잘못했어요

참고로 저거 치니까 유니프롯(Uniprot) 나오길래 여기서 아이디 제공해주는건가 했는데 저기다가 유니프롯 아이디 넣으니까 됨. 

 

from Bio import ExPASy
from Bio import SeqIO
with ExPASy.get_sprot_raw("Q14005") as handle:
    seq_record = SeqIO.read(handle, "swiss")
print(seq_record.name)
print(seq_record.description)

Q14005: Interleukin-16

 

IL16_HUMAN
RecName: Full=Pro-interleukin-16; Contains: RecName: Full=Interleukin-16; Short=IL-16; AltName: Full=Lymphocyte chemoattractant factor; Short=LCF;

뭐야 왜 돼요 ㄷㄷ 

 

딕셔너리화하기

뭐 하면 뭐 좋은갑지... 모르것다... (뇌클리어)

 

SeqIO.to_dict

from Bio import SeqIO
orchid_dict = SeqIO.to_dict(SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank"))
print(orchid_dict.keys())
print(orchid_dict.values())
dict_keys(['NM_000517.6'])
dict_values([SeqRecord(seq=Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA'), id='NM_000517.6', name='NM_000517', description='Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA', dbxrefs=[])])

이런 식으로 딕셔너리화할 수 있고 

 

print(orchid_dict['2B10_1|Chains'].description)
2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)

이런 식으로 셀렉터 적용도 된다. 

 

for record in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"):
    print(record.id, seguid(record.seq))

SEGUID checksum 함수를 적용시켜 딕셔너리화 할 수도 있다는데... 

2B10_1|Chains kSIi2w+CcP3y/k3bFRVhvvWtCtM
2B10_2|Chains SA4dH3bOSXcG43088Houu01TH2U

이건 키가 일단 너무 저세상이다. InChl key 못지 않음... 

seguid_dict = SeqIO.to_dict(SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta", "fasta"),lambda rec: seguid(rec.seq))
record = seguid_dict["kSIi2w+CcP3y/k3bFRVhvvWtCtM"]
print(record.id)
2B10_1|Chains

그러니까 저 함수 적용하고 저 키를 다 외우고 있으면 찾는 게 가능하다는 얘기. 

 

SeqIO.index

from Bio import SeqIO
orchid_dict = SeqIO.index("/home/koreanraichu/Octodon degus insulin.gb", "genbank")
print(orchid_dict.keys())
print(orchid_dict.values())
KeysView(SeqIO.index('/home/koreanraichu/Octodon degus insulin.gb', 'genbank', alphabet=None, key_function=None))
ValuesView(SeqIO.index('/home/koreanraichu/Octodon degus insulin.gb', 'genbank', alphabet=None, key_function=None))

예제 보면 리스트로 잘만 나오던데 나는 왜 저따구여? 

 

쓰기

언제까지 긁어만 올텐가... 

 

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

rec1 = SeqRecord(
    Seq(
        "MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
        "GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
        "NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
        "SSAC",
    ),
    id="gi|14150838|gb|AAK54648.1|AF376133_1",
    description="chalcone synthase [Cucumis sativus]",
)

rec2 = SeqRecord(
    Seq(
        "YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"
        "DMVVVEIPKLGKEAAVKAIKEWGQ",
    ),
    id="gi|13919613|gb|AAK33142.1|",
    description="chalcone synthase [Fragaria vesca subsp. bracteata]",
)

rec3 = SeqRecord(
    Seq(
        "MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"
        "EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"
        "KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"
        "NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"
        "SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"
        "IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"
        "TGEGLEWGVLFGFGPGLTVETVVLHSVAT",
    ),
    id="gi|13925890|gb|AAK49457.1|",
    description="chalcone synthase [Nicotiana tabacum]",
)

my_records = [rec1, rec2, rec3]

from Bio import SeqIO
SeqIO.write(my_records, "/home/koreanraichu/example.fasta", "fasta")

아니 이거 내가 IL로 트라이했는데 에러났음... 예제껀 잘됐는데 뭔 타입에러가 반기던데? 

아무튼 이런 식으로 저장된다. (메시지 출력 따로 안했다)

 

from Bio import SeqIO

records = SeqIO.parse("/home/koreanraichu/sequence.gb", "genbank")
count = SeqIO.write(records, "/home/koreanraichu/my_example.fasta", "fasta")
print("Converted %i records" % count)

count2 = SeqIO.convert("/home/koreanraichu/sequence.gb", "genbank", "/home/koreanraichu/my_example2.fasta", "fasta")
print("Converted %i records" % count2)

긁어오는 방법은 두 가지고 둘 다 된다. 

from Bio import SeqIO
records = [rec.reverse_complement(id="rc_"+rec.id, description = "reverse complement")
           for rec in SeqIO.parse("/home/koreanraichu/at5g40780.gb", "genbank")]
SeqIO.write(records, "/home/koreanraichu/at5g40780_rc.fasta", "fasta")

DNA의 경우 reverse complement sequence로 저장하는 것도 된다. (단백질은 하지말자...)

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 레코드 생성하고 만져보기

Coding/Python 2022. 8. 20. 02:30

오늘껀 근데 하면서 나도 이해 못했음... 


SeqRecord를 만들려면 뭘 또 모셔와야 하는데 바로 

from Bio.SeqRecord import SeqRecord

이놈이다. 

이정도면 복잡한 거 돌리려면 아주 모셔오는데만 열댓줄 쓰게 생겼는데??? 텐서플로우나 저기 그 넘파이 판다스처럼 합쳐버리지 그러냐... (걔네는 일단 한 번 데려오면 다 써먹을 수 있음) 


레코드 생성하기

이모지 넣고싶은데 이거 리눅스라 이모지 넣는 법을 모름... 아무튼 레코드는 이렇게 생성하면 된다. 

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
EcoRI=Seq("GAATTC") # ECoRI sequence
ECoRI_r=SeqRecord(EcoRI) #로 레코드를 만들어보겠습니다 
print(EcoRI)
print(ECoRI_r)

그러면 어떻게 나오느냐 

GAATTC # 시퀀스
ID: <unknown id> # 여기서부터 레코드
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('GAATTC')

이렇게 나온다. 뭐가 많이 빠졌지만 일단 레코드 맞다. 여기다 넣고싶으면 일단 다음 문단으로 가시고... 아니 그럼 생성할 때 넣을 수 있나요? 예. 됩니다. 

HaeIII_r=SeqRecord(HaeIII,id="RES_HAEIII",name="HaeIII restriction site",description="Restriction of HaeIII, cuts GG/CC")

노션도 되는 wrap이 안되는 코드블록... 네이버 니네 점검하면 대체 뭐하냐... 아무튼 이렇게 만들 때 넣으면 

ID: RES_HAEIII
Name: HaeIII restriction site
Description: Restriction of HaeIII, cuts GG/CC
Number of features: 0
Seq('GGCC')

이렇게 생성된다. 

 

레코드에 뭔가 넣기

아니 아까 만든거에 뭐 좀 어떻게 넣어주면 안됩니까? 아뇨 됩니다. 

# 레코드에 정보를 추가해보자
ECoRI_r.id="RES_ECORI"
ECoRI_r.name="ECoRI restriction site"
ECoRI_r.description="Restriction site of ECoRI, cuts G/AATTC"
# ID랑 이름, description 정보를 추가했다.

이렇게 필요한 정보를 넣어줄 수 있다. (참고: EcoRI이 맞는 표기다)

ID: RES_ECORI
Name: ECoRI restriction site
Description: Restriction site of ECoRI, cuts G/AATTC
Number of features: 0
Seq('GAATTC')

제가 된다고 했잖아여... 

HaeIII_r.annotations["notes"]="갑자기 해삼이 먹고싶다. " # annotation
HaeIII_r.letter_annotations['letter annotations']=['G','G/','C','C'] # letter annotations
/notes=갑자기 해삼이 먹고싶다. 
Per letter annotation for: letter annotations

주석도 넣을 수 있다. 참고로 HaeIII은 Haemophilus influenzae biogroup aegyptius에서 유래한 제한효소. 본인은 해삼 혹은 해쓰리로 읽는다. (자매품 XbaI...)

 

FASTA에서 긁어오기

FASTA와 Genbank 파일에서 레코드를 긁어올 수도 있다. 후자쪽이 정보는 많음. 

from Bio import SeqIO
record=SeqIO.read('/home/koreanraichu/rcsb_pdb_7PNN.fasta','fasta')
print(record)
record=SeqIO.parse('/home/koreanraichu/rcsb_pdb_2B10.fasta','fasta')

FASTA 파일에 하나 들어있으면 read, 두 개 이상이면 parse로 긁어야 한다. (두 개 이상인데 read로 긁으면 ValueError: More than one record found in handle가 당신을 반길 것이다)

 

print(record.id)
7PNN_1|Chain
for record2 in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta","fasta"):
    print(record2.id)
2B10_1|Chains
2B10_2|Chains

긁는 난이도치고 접근은 되게 쉽다. 근데 parse로 긁어오면 for문 써야 나온다... 

 

Genbank에서 긁어오기

record=SeqIO.read("/home/koreanraichu/sequence.gb",'genbank')
print(record)
print(record.description)
Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA

얘는 리드로 파스고 없이 걍 긁어오면 접근이 바로 된다. 편한데? 

 

Feature

시퀀스 레코드의 정보를 담고 있다고 하는데 뭔지 모르겠는 건 기분탓이 아니다. 여기서 볼 것은 .type와 .location, .qualifiers이다. 

 

.type

for feature in record.features:
    print(feature.type)
source
gene
CDS
sig_peptide
mat_peptide
mat_peptide
mat_peptide
regulatory
polyA_site

얘도 for문 돌려야 나온다. 타입은 말 그대로 타입. 

.location

for feature in record.features:
    print(feature.location)
[0:432](+)
[0:432](+)
[41:371](+)
[41:113](+)
[113:200](+)
[206:293](+)
[299:368](+)
[413:419](+)
[431:432](+)

Feature의 위치 정보. +는 뭐임? 

 

.qualifiers

for feature in record.features:
    print(feature.qualifiers)
OrderedDict([('organism', ['Octodon degus']), ('mol_type', ['mRNA']), ('db_xref', ['taxon:10160']), ('tissue_type', ['pancreas'])])
OrderedDict([('gene', ['insulin'])])
OrderedDict([('gene', ['insulin']), ('codon_start', ['1']), ('product', ['insulin']), ('protein_id', ['AAA40590.1']), ('translation', ['MAPWMHLLTVLALLALWGPNSVQAYSSQHLCGSNLVEALYMTCGRSGFYRPHDRRELEDLQVEQAELGLEAGGLQPSALEMILQKRGIVDQCCNNICTFNQLQNYCNVP'])])
OrderedDict([('gene', ['insulin'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin B-chain'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin C-peptide'])])
OrderedDict([('gene', ['insulin']), ('product', ['insulin A-chain'])])
OrderedDict([('regulatory_class', ['polyA_signal_sequence']), ('gene', ['insulin'])])
OrderedDict([('gene', ['insulin'])])

Feature를 딕셔너리 형태로 저장해둔 것. Octodon degus는 데덴네의 모티브가 되기도 한 데구의 학명이다. 

 

Position과 location

Feature의 위치 정보. Position은 0차원(점)이고 Location은 1차원(선)이다.

location은 포지션 두 개가 있어야된다. 여부터 여까지라는 얘기니께...

from Bio import SeqFeature
start_pos=SeqFeature.ExactPosition(15)
end_pos=SeqFeature.ExactPosition(30)
location=SeqFeature.FeatureLocation(start_pos, end_pos)
print(start_pos,end_pos,location)
15 30 [15:30]
start_pos2 = SeqFeature.AfterPosition(1)
end_pos2 = SeqFeature.BetweenPosition(9, left=8, right=9)
my_location = SeqFeature.FeatureLocation(start_pos2, end_pos2)
print(start_pos2,end_pos2,my_location)
>1 (8^9) [>1:(8^9)]

30쓰니까 에러뜸... 

 

포지션이 명확할 때도 있고 명확하지 않을 때도 있다. 무슨 양자역학이냐... 슈뢰딩거의 포지션

 

start_pos2 = SeqFeature.AfterPosition(1)
end_pos2 = SeqFeature.BeforePosition(8)
my_location = SeqFeature.FeatureLocation(start_pos2, end_pos2)
print(start_pos2,end_pos2,my_location)
>1 <8 [>1:<8]

참고로 애프터가 있다는 것은 비포가 있다는 얘기이기도 하다. (대충 펀쿨섹좌 짤)

 

실제 파일에 적용해보기

ID: NM_000517.6
Name: NM_000517
Description: Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
Number of features: 44
/molecule_type=mRNA
/topology=linear
/data_file_division=PRI
/date=06-SEP-2021
/accessions=['NM_000517']
/sequence_version=6
/keywords=['RefSeq', 'MANE Select']
/source=Homo sapiens (human)
/organism=Homo sapiens
/taxonomy=['Eukaryota', 'Metazoa', 'Chordata', 'Craniata', 'Vertebrata', 'Euteleostomi', 'Mammalia', 'Eutheria', 'Euarchontoglires', 'Primates', 'Haplorrhini', 'Catarrhini', 'Hominidae', 'Homo']
/references=[Reference(title='Molecular and Hematological Analysis of Alpha- and Beta-Thalassemia in a Cohort of Mexican Patients', ...), Reference(title='Molecular Characterization and Hematological Aspects of Hb E-Myanmar [beta26(B8)Glu-->Lys and beta65(E9)Lys-->Asn, HBB: c.[79G>A;198G>C]): A Novel beta-Thalassemic Hemoglobin Variant', ...), Reference(title='delta-Globin Chain Variants Associated with Decreased Hb A2 Levels: A National Reference Laboratory Experience', ...), Reference(title='Heterozygosity for the Novel HBA2: c.*91_*92delTA Polyadenylation Site Variant on the alpha2-Globin Gene Expanding the Genetic Spectrum of alpha-Thalassemia in Iran', ...), Reference(title='Identification and characterization of two novel and differentially expressed isoforms of human alpha2- and alpha1-globin genes', ...), Reference(title='Alpha-Thalassemia', ...), Reference(title="Cloning and complete nucleotide sequence of human 5'-alpha-globin gene", ...), Reference(title='A new abnormal human hemoglobin: Hb Prato (alpha 2 31 (B12) Arg leads to Ser beta 2)', ...), Reference(title='Hemoglobin Suan-Dok (alpha 2 109 (G16) Leu replaced by Arg beta 2): an unstable variant associated with alpha-thalassemia', ...), Reference(title='Hemoglobin Tarrant: alpha126(H9) Asp leads to Asn. A new hemoglobin variant in the alpha1beta1 contact region showing high oxygen affinity and reduced cooperativity', ...)]
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from V00493.1, Z84721.1 and
AI016696.1.
This sequence is a reference standard in the RefSeqGene project.
On Aug 3, 2018 this sequence version replaced NM_000517.5.
Summary: The human alpha globin gene cluster located on chromosome
16 spans about 30 kb and includes seven loci: 5'- zeta - pseudozeta
- mu - pseudoalpha-1 - alpha-2 - alpha-1 - theta - 3'. The alpha-2
(HBA2) and alpha-1 (HBA1) coding sequences are identical. These
genes differ slightly over the 5' untranslated regions and the
introns, but they differ significantly over the 3' untranslated
regions. Two alpha chains plus two beta chains constitute HbA,
which in normal adult life comprises about 97% of the total
hemoglobin; alpha chains combine with delta chains to constitute
HbA-2, which with HbF (fetal hemoglobin) makes up the remaining 3%
of adult hemoglobin. Alpha thalassemias result from deletions of
each of the alpha genes as well as deletions of both HBA2 and HBA1;
some nondeletion alpha thalassemias have also been reported.
[provided by RefSeq, Jul 2008].
Publication Note:  This RefSeq record includes a subset of the
publications that are available for this gene. Please see the Gene
record to access additional publications.
COMPLETENESS: full length.
/structured_comment=OrderedDict([('Evidence-Data', OrderedDict([('Transcript exon combination', 'AI816040.1, AI815806.1 [ECO:0000332]'), ('RNAseq introns', 'single sample supports all introns SAMEA1965299, SAMEA1966682 [ECO:0000348]')])), ('RefSeq-Attributes', OrderedDict([('MANE Ensembl match', 'ENST00000251595.11/ ENSP00000251595.6'), ('RefSeq Select criteria', 'based on single protein-coding transcript')]))])
Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA')

오늘의 시범조교. 뭐야 wrap 돌려줘요.

 

record=SeqIO.read("/home/koreanraichu/sequence.gb",'genbank') # 여기서 뭘 뽑아봅시다
index=575
for feature in record.features:
    if index in feature:
        print("%s, %s, %s" %(feature.type,feature.location,feature.qualifiers.get("db_xref")))

여기서 575가 들어가는 feature를 찾아보자. 

 

source, [0:576](+), ['taxon:9606']
gene, [0:576](+), ['GeneID:3040', 'HGNC:HGNC:4824', 'MIM:141850']
exon, [337:576](+), None

아 그래서 이렇게 나온건가... (아직도 이해 못함) 

 

SeqFeature로 시퀀스의 특정 부분만 잘라서 보기

from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation
at5g40780=Seq("MVAQAPHDDHQDDEKLAAARQKEIEDWLPITSSRNAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVITLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMWRGVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQIYAMPVFDMMETLLVKKLNFRPTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGLRTIVIQAKGYKFYS")
# 단백질 시퀀스 왜 이렇게 길어 이거
feature = SeqFeature(FeatureLocation(0, 30), type="protein", strand=1)
feature_seq = at5g40780[feature.location.start:feature.location.end]
print(feature_seq)

TAIR에서 돌아온 시범조교, LHT1의 단백질 시퀀스다. 뭐요. 파이참 wrap 일일이 켜기도 귀찮은데 걍 혀... 아무튼 feature라는 변수를 통해 30번째까지 볼 거라는 얘기다. strand가 -1이면 상보적인 시퀀스를 추출한다는데 저거 단백질이라 의미 없음. 

 

feature_seq2 = feature.extract(at5g40780)
print(feature_seq2)

Feature 변수 주고 extract해도 된다. 

 

레코드 비교하기

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record1=SeqRecord(Seq("GAATTC"), id="ECoR1",description="sticky")
record2=SeqRecord(Seq("GGCC"), id="HaeIII",description="blunt")
record3=SeqRecord(Seq("CTCGAG"), id="XhoI",description="sticky")
record4=SeqRecord(Seq("GATC"), id="DpnII",description="sticky")

이번 시범조교는 얘네들이다. 

레코드 비교? 그럼 다이렉트로 비교 되나요? 그러면 NotImplementedError: SeqRecord comparison is deliberately not implemented. Explicitly compare the attributes of interest. 에러가 당신을 반겨줄 것이다. 레코드 비교는 안에 든 걸 비교하는거지 레코드를 쌩으로 비교하라는 얘기가 아니다. 

 

print(record1.description==record3.description)
True

EcoRI과 XhoI의 description이 같으므로 참이다.

print(len(record1.seq)==len(record2.seq))
False

EcoRI과 HaeIII의 시퀀스 길이는 다르므로 false. 

 

포맷 만들기

아니 컴퓨터 포맷 아님. 

 

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record=SeqRecord(
    Seq(
        "ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA"),
id="at5g40780|LHT1",
description="Lysine-histidine transporter 1|Arabidopsis thaliana")
print(record.format("fasta"))

이런 식으로 입력하면 

 

>at5g40780|LHT1 Lysine-histidine transporter 1|Arabidopsis thaliana
ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGA
CAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTAC
TCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCC
ATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACA
CTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGAT
CGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTG
CCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAA
TCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTAT
TTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCC
ATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGG
GCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACA
ACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCG
GGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCA
AAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTC
CCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATG
TCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTC
ATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTC
AAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTT
GCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTT
GGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATC
TACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGT
CTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAA
GGATACAAGTTTTACTCATAA

이렇게 FASTA 포맷이 된다. 

 

슬라이싱

feature도 slicing이 된다. 어째서인지는 모르겠으나... 

 

from Bio import SeqIO
record = SeqIO.read("/home/koreanraichu/sequence.gb", "genbank")
print(record.features[20])

이건 근데 인덱싱 아니냐. 

 

sub_record = record[0:5]
print(sub_record)
ID: NM_000517.6
Name: NM_000517
Description: Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
Number of features: 0
/molecule_type=mRNA
Seq('ACTCT')

아무튼 레코드에서 0부터 5까지 뽑아보았다. 염기 6개가 나온다. 

 

record = SeqIO.read("/home/koreanraichu/sequence.gb", "genbank")
sub_record = record[0:576]
print(sub_record.features[0])

범위를 풀로 바꾸고 0번째 feature를 봤다. 

 

type: source
location: [0:576](+)
qualifiers:
    Key: chromosome, Value: ['16']
    Key: db_xref, Value: ['taxon:9606']
    Key: map, Value: ['16p13.3']
    Key: mol_type, Value: ['mRNA']
    Key: organism, Value: ['Homo sapiens']

그러면 이렇게 나옵니다.

 

print(sub_record.format("genbank"))
print(sub_record.format("fasta"))

위 코드를 이용해 출력 포맷을 바꿀 수도 있다. 

>NM_000517.6 Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA
ACTCTTCTGGTCCCCACAGA
# FASTA
LOCUS       NM_000517                 20 bp    mRNA             UNK 01-JAN-1980
DEFINITION  Homo sapiens hemoglobin subunit alpha 2 (HBA2), mRNA.
ACCESSION   NM_000517
VERSION     NM_000517.6
KEYWORDS    .
SOURCE      .
  ORGANISM  .
            .
FEATURES             Location/Qualifiers
ORIGIN
        1 actcttctgg tccccacaga
//
# Genbank

위는 FASTA, 아래는 Genbank 포맷. 

 

'Coding > Python' 카테고리의 다른 글

번외편-코딩테스트 풀이  (0) 2022.08.20
Biopython으로 MSA 해보기  (0) 2022.08.20
Biopython SeqIO 써보기  (0) 2022.08.20
Biopython으로 시퀀스 다뤄보기  (0) 2022.08.20
Biopython으로 시퀀스 가져오기  (0) 2022.08.20
Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 다뤄보기

Coding/Python 2022. 8. 20. 02:14

전에요? 아니 그건 가져온거고. 


시퀀스도 텍스트나 리스트처럼 인덱싱과 슬라이싱이 된다. 방법도 똑같다. 그래서 그건 생략할 예정임... 그리고 그거 아니어도 분량이 많아요 또. 

 

.count()

특정 문자열이 몇 개인지 세 준다. 뭐 시퀀스에서 아데닌이나 구아닌 갯수 세주고 그런거다. 이걸 이용하면 특정 단백질 시퀀스에서 특정 아미노산(카테고리면 겁나 세야된다...)이 차지하는 비율도 볼 수 있다. 

 

쉬어가는 코너-Primer에서 GC함량 구하기

아 이거 중요합니다. Primer 만들 때 중요한 요소 중 하나가 GC 함량이다. 참고로 여기서는 두 가지 방법으로 구해볼건데 첫번째는 .count()를 이용해 G와 C의 수를 세서 전체 DNA 염기 수로 나누는 거고, 두번째는 바이오파이썬의 모듈을 이용하는 방법이다. 

GC_cont=(primer.count('C')+primer.count('G'))/len(primer)
print(GC_cont*100,"%")
from Bio.SeqUtils import GC
primer=Seq('CAGCAAGCAAAGGTGTTCAA')
print(GC(primer),"%")

위는 직접 계산하는 방법이고 아래는 모듈을 사용했다. 모듈은 쓰려면 또 모셔와야 한다. 

 

시퀀스에 시퀀스를 더하면 1+1인가요 

시퀀스 그냥 +로 갖다 붙이거나 리스트업된 거 for문으로 붙이거나 join()쓰거나... 일단 for와 join 보고 갑시다. 

seq_list=[Seq('ATCC'),Seq('ATAT'),Seq('TNNN')]
add = Seq("")
for i in seq_list:
    add += i
print(add)
seq_list=[Seq('ATCC'),Seq('ATAT'),Seq('TNNN')]
spacer=Seq('N'*5)
print(spacer.join(seq_list))

 

돌연변이 만들기

시퀀스 데이터는 튜플마냥 안에 있는 내용물을 수정할 수 없다. 수정하려면 또 다른 모듈을 모셔와야 한다. 

from Bio.Seq import Seq
from Bio.Seq import MutableSeq #이걸 불러와서 
my_seq=Seq("GAATTC")
mutable_seq=MutableSeq(my_seq) #적용해줘야 한다
mutable_seq[0]="A"
print(mutable_seq)

Mutableseq에 던져둔 것은 타입도 

<class 'Bio.Seq.MutableSeq'>

로 바뀐다. 일반 시퀀스는 Seq. 

 

전사번역

입력한 시퀀스를 mRNA로 전사도 해 주고, 번역도 된다. 물론 다이렉트로 된다. 

 

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqUtils import GC
at5g40780=Seq('ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA')
at5g40780_rc=at5g40780.reverse_complement()
# DNA와 상보적인 sequence를 만들었다.

at5g40780_mrna=at5g40780.transcribe()
at5g40780_mrna_rc=at5g40780_mrna.reverse_complement()
# 여기 RNA 추가요!
print(at5g40780_mrna)
AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA

.transcribe()를 이용하면 전사도 해 준다. 예시에 있는 시퀀스는 at5g40780(의 CDS). 

 

from Bio.Seq import Seq
at5g40780=Seq('ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA')
# DNA sequence

at5g40780_mrna=at5g40780.transcribe()
# Transcription
print(at5g40780_mrna)
AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA

.translate()는 번역이다. (*은 종결코돈)

 

전사는 그냥 뭐 전사했구나 하고 넘어가도 되는데, 번역은 옵션이 좀 있으니 일단 보고 가자. 

to_stop=True: 처음 만나는 종결코돈에서 멈춘다

from Bio.Seq import Seq
at5g40780_mrna=Seq('AUGGUAGCUCAAGCUCCUCAUGAUGAUCAUCAGGAUGAUGAGAAAUUAGCAGCAGCGAGACAAAAAGAGAUCGAAGAUUGGUUACCAAUUACUUCAUCAAGAAAUGCAAAGUGGUGGUACUCUGCUUUUCACAAUGUCACCGCCAUGGUCGGUGCCGGAGUUCUUGGACUCCCUUACGCCAUGUCUCAGCUCGGAUGGGGACCGGGAAUUGCAGUGUUGGUUUUGUCAUGGGUCAUAACACUAUACACAUUAUGGCAAAUGGUGGAAAUGCAUGAAAUGGUUCCUGGAAAGCGUUUUGAUCGUUACCAUGAGCUCGGACAACACGCGUUUGGAGAAAAACUCGGUCUUUAUAUCGUUGUGCCGCAACAAUUGAUCGUUGAAAUCGGUGUUUGCAUCGUUUAUAUGGUCACUGGAGGCAAAUCUUUAAAGAAAUUUCAUGAGCUUGUUUGUGAUGAUUGUAAACCAAUCAAGCUUACUUAUUUCAUCAUGAUCUUUGCUUCGGUUCACUUCGUCCUCUCUCAUCUUCCUAAUUUCAAUUCCAUCUCCGGCGUUUCUCUUGCUGCUGCCGUUAUGUCUCUCAGCUACUCAACAAUCGCAUGGGCAUCAUCAGCAAGCAAAGGUGUUCAAGAAGACGUUCAAUACGGUUACAAAGCGAAAACAACAGCCGGUACGGUUUUCAAUUUCUUCAGCGGUUUAGGUGAUGUGGCAUUUGCUUACGCGGGUCAUAAUGUGGUCCUUGAGAUCCAAGCAACUAUCCCUUCAACGCCUGAGAAACCCUCAAAAGGUCCCAUGUGGAGAGGAGUCAUCGUUGCUUACAUCGUCGUAGCGCUCUGUUAUUUCCCCGUGGCUCUCGUUGGAUACUACAUUUUCGGGAACGGAGUCGAAGAUAAUAUUCUCAUGUCACUUAAGAAACCGGCGUGGUUAAUCGCCACGGCGAACAUCUUCGUCGUGAUCCAUGUCAUUGGUAGUUACCAGAUAUAUGCAAUGCCGGUAUUUGAUAUGAUGGAGACUUUAUUGGUCAAGAAGCUAAAUUUCAGACCAACCACAACUCUUCGGUUCUUUGUCCGUAAUUUCUACGUUGCUGCAACUAUGUUUGUUGGUAUGACGUUUCCGUUCUUCGGUGGGCUUUUGGCGUUCUUUGGUGGAUUCGCGUUUGCCCCAACCACAUACUUCCUCCCUUGCGUCAUUUGGUUAGCCAUCUACAAACCCAAGAAAUACAGCUUGUCUUGGUGGGCCAACUGGGUAUGCAUCGUGUUUGGUCUUUUCUUGAUGGUCCUAUCGCCAAUUGGAGGGCUAAGGACAAUCGUUAUUCAAGCAAAAGGAUACAAGUUUUACUCAUAA')
# 이건 mRNA 시퀀스고
at5g40780_protein=at5g40780_mrna.translate(to_stop=True)
#번역했다.
print(at5g40780_protein)
MVAQAPHDDHQDDEKLAAARQKEIEDWLPITSSRNAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVITLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMWRGVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQIYAMPVFDMMETLLVKKLNFRPTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGLRTIVIQAKGYKFYS


table=2: 아래 사이트에서 제공하는 코돈 테이블로 번역할 수 있다. (기본은 1번 스탠다드)
https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

 

https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi

The Genetic Codes Compiled by Andrzej (Anjay) Elzanowski and Jim Ostell at National Center for Biotechnology Information (NCBI), Bethesda, Maryland, U.S.A. Last update of the Genetic Codes: Jan. 7, 2019 NCBI takes great care to ensure that the translation

www.ncbi.nlm.nih.gov

 

at5g40780_protein=at5g40780_mrna.translate(table=2)
MVAQAPHDDHQDDEKLAAA*QKEIEDWLPITSS*NAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVMTLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMW*GVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQMYAMPVFDMMETLLVKKLNF*PTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGL*TIVIQAKGYKFYS*

위 CDS를 table 2(Vertebrate mitochondria)로 설정하고 번역한 결과. *은 종결 코돈이다. 

stop_symbol="@": 종결코돈의 기호를 @로 바꾼다. 

at5g40780_protein=at5g40780_mrna.translate(table=2,stop_symbol="-")
MVAQAPHDDHQDDEKLAAA-QKEIEDWLPITSS-NAKWWYSAFHNVTAMVGAGVLGLPYAMSQLGWGPGIAVLVLSWVMTLYTLWQMVEMHEMVPGKRFDRYHELGQHAFGEKLGLYIVVPQQLIVEIGVCIVYMVTGGKSLKKFHELVCDDCKPIKLTYFIMIFASVHFVLSHLPNFNSISGVSLAAAVMSLSYSTIAWASSASKGVQEDVQYGYKAKTTAGTVFNFFSGLGDVAFAYAGHNVVLEIQATIPSTPEKPSKGPMW-GVIVAYIVVALCYFPVALVGYYIFGNGVEDNILMSLKKPAWLIATANIFVVIHVIGSYQMYAMPVFDMMETLLVKKLNF-PTTTLRFFVRNFYVAATMFVGMTFPFFGGLLAFFGGFAFAPTTYFLPCVIWLAIYKPKKYSLSWWANWVCIVFGLFLMVLSPIGGL-TIVIQAKGYKFYS-

위 예시에서는 -로 바꿨다. 

 

글자 다이렉트로 전사번역하기

시퀀스도 글자긴 한데 str과 다른 점이 있다면 상보적인 시퀀스를 만들 수 있다는 것이다. 근데 그렇다고 헐 시퀀스 안만들었다! 망했다! 이건 아니고... 

from Bio.Seq import reverse_complement, transcribe, back_transcribe, translate #질문있는데 이걸 꼭 이렇게 불러야겠니
insulin="GCATTCTGAGGCATTCTCTAACAGGTTCTCGACCCTCCGCCATGGCCCCGTGGATGCATCTCCTCACCGTGCTGGCCCTGCTGGCCCTCTGGGGACCCAACTCTGTTCAGGCCTATTCCAGCCAGCACCTGTGCGGCTCCAACCTAGTGGAGGCACTGTACATGACATGTGGACGGAGTGGCTTCTATAGACCCCACGACCGCCGAGAGCTGGAGGACCTCCAGGTGGAGCAGGCAGAACTGGGTCTGGAGGCAGGCGGCCTGCAGCCTTCGGCCCTGGAGATGATTCTGCAGAAGCGCGGCATTGTGGATCAGTGCTGTAATAACATTTGCACATTTAACCAGCTGCAGAACTACTGCAATGTCCCTTAGACACCTGCCTTGGGCCTGGCCTGCTGCTCTGCCCTGGCAACCAATAAACCCCTTGAATGAG"
#Wrap 알아서 켜주면 안되냐고...
insulin_rc=reverse_complement(insulin)
insulin_mrna=transcribe(insulin)
insulin_prot=translate(insulin)
# 순서대로 reverse complement/전사/번역 
print(insulin_rc)
print(insulin_mrna)
print(insulin_prot)

시퀀스때랑 불러오는 건 다르지만 어쨌든 된다. 

 

코돈 테이블 보기

내 블로그 아미노산 배경화면에도 있긴 한데 그걸 누가 외웁니까... 아무튼 그렇다. 

 

from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_name["Standard"]
print(standard_table)
from Bio.Data import CodonTable
standard_table = CodonTable.unambiguous_dna_by_id[1]
print(standard_table)

둘 다 같은 테이블을 호출하는건데, 위에껀 이름으로 부르고 밑에껀 ID로 부른다. 이름이나 ID는 위에 올린 NCBI 주소로 가면 나오니까 그걸로 부르면 된다. 

 

Table 1 Standard, SGC0

  |  T      |  C      |  A      |  G      |
--+---------+---------+---------+---------+--
T | TTT F   | TCT S   | TAT Y   | TGT C   | T
T | TTC F   | TCC S   | TAC Y   | TGC C   | C
T | TTA L   | TCA S   | TAA Stop| TGA Stop| A
T | TTG L(s)| TCG S   | TAG Stop| TGG W   | G
--+---------+---------+---------+---------+--
C | CTT L   | CCT P   | CAT H   | CGT R   | T
C | CTC L   | CCC P   | CAC H   | CGC R   | C
C | CTA L   | CCA P   | CAA Q   | CGA R   | A
C | CTG L(s)| CCG P   | CAG Q   | CGG R   | G
--+---------+---------+---------+---------+--
A | ATT I   | ACT T   | AAT N   | AGT S   | T
A | ATC I   | ACC T   | AAC N   | AGC S   | C
A | ATA I   | ACA T   | AAA K   | AGA R   | A
A | ATG M(s)| ACG T   | AAG K   | AGG R   | G
--+---------+---------+---------+---------+--
G | GTT V   | GCT A   | GAT D   | GGT G   | T
G | GTC V   | GCC A   | GAC D   | GGC G   | C
G | GTA V   | GCA A   | GAA E   | GGA G   | A
G | GTG V   | GCG A   | GAG E   | GGG G   | G
--+---------+---------+---------+---------+--

아무튼 소환했다. 

 

색인

저거 검색도 된다. 실화다. 

 

print(standard_table.stop_codons)
['TAA', 'TAG', 'TGA']

해당 테이블의 Stop codon을 검색하거나 

 

print(mito_table.start_codons)
['ATT', 'ATC', 'ATA', 'ATG', 'GTG']

개시코돈을 검색하거나(저거 2번 테이블이다)

 

print(standard_table.forward_table['AAA'])
K

특정 코돈이 지정하는 아미노산을 찾을 수 있다. 

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

Biopython으로 시퀀스 가져오기

Coding/Python 2022. 8. 20. 02:03

참고로 다른 코딩글은 여기다가 안 올린다. 코드블럭도 없고 카테고리도 없어서 어지간한 코딩 이야기는 노션과 워드프레스에 올리는 중... 아니면 미디움이나. 근데 이건 우째도 올렸네? 아 생물학 카테고리가 있잖아요. 

Biopython은 생물정보학에서 쓰는건데 마침 심심하던 차에 잘됐다 써봐야징 해서 어제 깔았다. 리눅스의 경우 pip install bio(안되면 biopython)으로 걍 터미널에서 깔면 된다(우분투 20.02 LTS 기준). 윈도우마냥 pip 있는 데 안 찾아가도 됨 ㄹㅇ 편함. 근데 파이참 바로가기 없는 건 너무한 거 아니냐고... 내가 터미널 열고 직접 모시러 가야것냐... 

*Notes: Biopython을 사용하려면 사용할 기능에 맞는 모듈을 모셔와야 한다. 코드에 그게 생략되는 경우가 있는데(부분삽입의 경우) from ~ import ~가 생략된 것이다. 


일단 오늘의 시범조교를 모셔보자. 

ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA

At5g40780의 CDS 시퀀스이다. (코드블럭은 있는데 wrap이 안돼서 걍 생으로 붙였음)

이걸 시퀀스 오브젝트로 생성하는 방법도 간단하다. 그냥 모듈 모셔온 다음 Seq=("")로 감싸버리면 된다. 

at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGTGGTACTCTGCTTTTCACAATGTCACCGCCATGGTCGGTGCCGGAGTTCTTGGACTCCCTTACGCCATGTCTCAGCTCGGATGGGGACCGGGAATTGCAGTGTTGGTTTTGTCATGGGTCATAACACTATACACATTATGGCAAATGGTGGAAATGCATGAAATGGTTCCTGGAAAGCGTTTTGATCGTTACCATGAGCTCGGACAACACGCGTTTGGAGAAAAACTCGGTCTTTATATCGTTGTGCCGCAACAATTGATCGTTGAAATCGGTGTTTGCATCGTTTATATGGTCACTGGAGGCAAATCTTTAAAGAAATTTCATGAGCTTGTTTGTGATGATTGTAAACCAATCAAGCTTACTTATTTCATCATGATCTTTGCTTCGGTTCACTTCGTCCTCTCTCATCTTCCTAATTTCAATTCCATCTCCGGCGTTTCTCTTGCTGCTGCCGTTATGTCTCTCAGCTACTCAACAATCGCATGGGCATCATCAGCAAGCAAAGGTGTTCAAGAAGACGTTCAATACGGTTACAAAGCGAAAACAACAGCCGGTACGGTTTTCAATTTCTTCAGCGGTTTAGGTGATGTGGCATTTGCTTACGCGGGTCATAATGTGGTCCTTGAGATCCAAGCAACTATCCCTTCAACGCCTGAGAAACCCTCAAAAGGTCCCATGTGGAGAGGAGTCATCGTTGCTTACATCGTCGTAGCGCTCTGTTATTTCCCCGTGGCTCTCGTTGGATACTACATTTTCGGGAACGGAGTCGAAGATAATATTCTCATGTCACTTAAGAAACCGGCGTGGTTAATCGCCACGGCGAACATCTTCGTCGTGATCCATGTCATTGGTAGTTACCAGATATATGCAATGCCGGTATTTGATATGATGGAGACTTTATTGGTCAAGAAGCTAAATTTCAGACCAACCACAACTCTTCGGTTCTTTGTCCGTAATTTCTACGTTGCTGCAACTATGTTTGTTGGTATGACGTTTCCGTTCTTCGGTGGGCTTTTGGCGTTCTTTGGTGGATTCGCGTTTGCCCCAACCACATACTTCCTCCCTTGCGTCATTTGGTTAGCCATCTACAAACCCAAGAAATACAGCTTGTCTTGGTGGGCCAACTGGGTATGCATCGTGTTTGGTCTTTTCTTGATGGTCCTATCGCCAATTGGAGGGCTAAGGACAATCGTTATTCAAGCAAAAGGATACAAGTTTTACTCATAA")
print(at5g40780)

이런 식으로. 

print(type(at5g40780))
<class 'Bio.Seq.Seq'>

누가봐도 시퀀스다. 

엥 근데 저거 걍 텍스트랑 뭔 차이임? Biopython을 통해 시퀀스로 생성한 데이터는 complement sequence와 reverse complement sequence를 만들 수 있다. 즉, 시퀀스로 불러온 CDS에 상보적인 시퀀스를 명령어 한 방에 만들 수 있는 것이다. 텍스트는 저런거 안된다. 

from Bio.Seq import Seq
at5g40780=Seq("ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT")
print("Sequence",at5g40780)
print("Complement sequence",at5g40780.complement())
print("Reverse complement sequence",at5g40780.reverse_complement())

이렇게 입력하면 

Sequence ATGGTAGCTCAAGCTCCTCATGATGATCATCAGGATGATGAGAAATTAGCAGCAGCGAGACAAAAAGAGATCGAAGATTGGTTACCAATTACTTCATCAAGAAATGCAAAGTGGT
Complement sequence TACCATCGAGTTCGAGGAGTACTACTAGTAGTCCTACTACTCTTTAATCGTCGTCGCTCTGTTTTTCTCTAGCTTCTAACCAATGGTTAATGAAGTAGTTCTTTACGTTTCACCA
Reverse complement sequence ACCACTTTGCATTTCTTGATGAAGTAATTGGTAACCAATCTTCGATCTCTTTTTGTCTCGCTGCTGCTAATTTCTCATCATCCTGATGATCATCATGAGGAGCTTGAGCTACCAT

짜자잔 

complement는 CDS 시퀀스에 그냥 상보적인거고, reverse complement는 그 상보적인 시퀀스 순서를 뒤집은 모양이다. 


아니 근데 DNA가 겁나 바이트가 길어요... 저 시범조교 CDS도 1300자가 넘는데 어느 세월에 그거 복붙할거임... 그래서 FASTA와 Genbank의 데이터를 직접 열어볼 것이다. 

FASTA 시범조교는 PDB ID 2B10, cytochrome c peroxiase이다. 

>2B10_1|Chains A, C|Cytochrome c peroxidase, mitochondrial|Saccharomyces cerevisiae (4932)
TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTSGTWDKHDNTGGSYGGTYRFKKEFNDPSNAGLQNGFKFLEPIHKEFPWISSGDLFSLGGVTAVQEMQGPKIPWRCGRVDTPEDTTPDNGRLPDADKDADYVRTFFQRLNMNDREVVALMGAHALGKTHLKNSGYEGPWGAANNVFTNEFYLNLLNEDWKLEKNDANNEQWDSKSGYMMLPTDYSLIQDPKYLSIVKEYANDQDKFFKDFSKAFEKLLENGITFPKDAPSPFIFKTLEEQGL
>2B10_2|Chains B, D|Cytochrome c iso-1|Saccharomyces cerevisiae (4932)
TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYTDANIKKNVLWDENNMSEYLTNPKKYIPGTKMASGGLKKEKDRNDLITYLKKACE

FASTA 파일 형식은 이렇게 생겼다. 일단 저 파일을 어떻게 가져올거냐... PDB 가서 2B10으로 검색하고 FASTA 파일을 받아보자. 

for seq_record in SeqIO.parse("/home/koreanraichu/rcsb_pdb_2B10.fasta","fasta"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))
2B10_1|Chains
Seq('TTPLVHVASVEKGRSYEDFQKVYNAIALKLREDDEYDNYIGYGPVLVRLAWHTS...QGL')
294
2B10_2|Chains
Seq('TEFKAGSAKKGATLFKTRCLQCHTVEKGGPHKVGPNLHGIFGRHSGQAEGYSYT...ACE')
108

FASTA 파일은 이런 식으로 가져올 수 있다. (seqIO 데려오면 된다) 예? 경로요? 아니 이거 리눅슨디? 

사실 분량때문에 넣지는 못하지만 

print(seq_record)

만 치면 전체 파일을 볼 수 있다. 처음 하시는 분들은 전체를 보고 저 예제를 따라해보는 걸 추천한다. 


for seq_record in SeqIO.parse("/home/koreanraichu/sequence.gb","genbank"):
    print(seq_record.id)
    print(repr(seq_record.seq))
    print(len(seq_record))

Genbank에서 뭐 검색하고 그거 gk(gbk)파일로 받으면 그게 genbank 파일이다. 이것도 일단 다운받아서 불러와야 한다. 웹에 있는거요? 그거 바로 못갖고온다. 

아무튼 저 코드를 실행하면 

NM_000517.6
Seq('ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGC...GCA')
576

이렇게 된다. 그리고 얘도 FASTA와 마찬가지로 

print(seq_record)

를 주면 전체 파일을 볼 수는 있는데 분량이 어유...

Lv. 35 라이츄

Lv. 35 라이츄

광고 매크로 없는 청정한 블로그를 위해 노력중입니다. 근데 나만 노력하는 것 같음… ㅡㅡ

방명록