barcode

Biopython-dbSNP와 Clinvar

Coding/Python

이놈들아 이것도 되면 좀 된다고 말좀 해줘... 

참고로 이거 어떻게 알았냐면 면접보는 회사에서 발표주제 중 하나가 저 두놈이었는데 찾다보니 NCBI에서 만든거네? -> Entrez에 있네? -> 비켜봐 시켜볼 게 있어(주섬주섬 파이참을 켠다) 가 된 거임. 


dbSNP

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com" # 내가 누구인지 말해주는 과정이 필요하다고...
# 이메일은 자기꺼 그냥 쓰세요
handle = Entrez.esearch(db="snp", term="EGFR", retmax="40" )
record = Entrez.read(handle)
print(record)

참고로 db에는 snp라고 써야지 dbsnp라고 쓰면 안된다. 

 

Entrez.email = "blackholekun@gmail.com"
handle = Entrez.esearch(db="snp", term="rs1050171", retmax="40" )
record = Entrez.read(handle)
print(record['IdList'])
IdList=list(record['IdList'])
# dbSNP+esearch
for i in range(len(record['IdList'])):
    handle = Entrez.esummary(db="snp", id=IdList[i], retmode="xml")
    records = Entrez.read((handle))
    print(records)
# 일단 esummary 써서 다 가져왔다.

일단 워드클라우드 코드를 좀 응용해서 있는걸 다 털었는데 분량이 장난없어요... (마른세수) 이와중에 5000자 넘는다고 네이버 코드블록이 짤랐어 또... 

 

{'DocumentSummarySet': DictElement({'DocumentSummary': [DictElement({'SNP_ID': '1050171', 'ALLELE_ORIGIN': '', 'GLOBAL_MAFS': [{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}, {'STUDY': 'ALSPAC', 'FREQ': 'G=0.424235/1635'}, {'STUDY': 'Estonian', 'FREQ': 'G=0.435714/1952'}, {'STUDY': 'ExAC', 'FREQ': 'G=0.477223/57869'}, {'STUDY': 'FINRISK', 'FREQ': 'A=0.486842/148'}, {'STUDY': 'GnomAD', 'FREQ': 'G=0.483832/67753'}, {'STUDY': 'GnomAD_exomes', 'FREQ': 'G=0.476248/119743'}, {'STUDY': 'GoESP', 'FREQ': 'G=0.457558/5951'}, {'STUDY': 'GoNL', 'FREQ': 'G=0.380762/380'}, {'STUDY': 'HapMap', 'FREQ': 'A=0.405488/133'}, {'STUDY': 'KOREAN', 'FREQ': 'A=0.133219/389'}, {'STUDY': 'Korea1K', 'FREQ': 'A=0.126092/231'}, {'STUDY': 'MGP', 'FREQ': 'G=0.440075/235'}, {'STUDY': 'NorthernSweden', 'FREQ': 'G=0.483333/290'}, {'STUDY': 'PRJEB36033', 'FREQ': 'G=0.319149/30'}, {'STUDY': 'PRJEB37584', 'FREQ': 'A=0.153944/121'}, {'STUDY': 'PharmGKB', 'FREQ': 'G=0.344828/40'}, {'STUDY': 'Qatari', 'FREQ': 'G=0.439815/95'}, {'STUDY': 'SGDP_PRJ', 'FREQ': 'G=0.302083/116'}, {'STUDY': 'Siberian', 'FREQ': 'G=0.368421/14'}, {'STUDY': 'TOMMO', 'FREQ': 'A=0.159327/2670'}, {'STUDY': 'TOPMED', 'FREQ': 'G=0.482935/127828'}, {'STUDY': 'TWINSUK', 'FREQ': 'G=0.416667/1545'}, {'STUDY': 'Vietnamese', 'FREQ': 'A=0.285016/175'}, {'STUDY': 'ALFA', 'FREQ': 'G=0.427251/38873'}], 'GLOBAL_POPULATION': '', 'GLOBAL_SAMPLESIZE': '0', 'SUSPECTED': '', 'CLINICAL_SIGNIFICANCE': 'likely-benign,benign', 'GENES': [{'NAME': 'EGFR', 'GENE_ID': '1956'}, {'NAME': 'EGFR-AS1', 'GENE_ID': '100507500'}], 'ACC': 'NC_000007.14', 'CHR': '7', 'HANDLE': 'GENOMED,GRF,BUSHMAN,PJP,EVA_SAMSUNG_MC,SEATTLESEQ,CSHL,ILLUMINA,WUGSC_SSAHASNP,COMPLETE_GENOMICS,CSHL-HAPMAP,EGP_SNPS,KHV_HUMAN_GENOMES,EVA,EVA_FINRISK,BCMHGSC_JDW,EGCUT_WGS,JMKIDD_LAB,JJLAB,CGAP-GAI,CANCER-GENOME,EVA_UK10K_TWINSUK,ACPOP,TOMMO_GENOMICS,BGI,TISHKOFF,SGDP_PRJ,GNOMAD,SSMP,NHLBI-ESP,1000GENOMES,WEILL_CORNELL_DGM,ILLUMINA-UK,HUMANGENOME_JCVI,EVA-GONL,GMI,SI_EXO,EVA_EXAC,PACBIO,ENSEMBL,HAMMER_LAB,EVA_UK10K_ALSPAC,DDI,TOPMED,USC_VALOUEV,ABI,EVA_MGP,PERLEGEN,BIOINF_KMB_FNS_UNIBA,CLINSEQ_SNP,URBANLAB,KRGDB,PHARMGKB_PAAR-UCHI,HUMAN_LONGEVITY,LMM-PCPGM,LEE,OMUKHERJEE_ADBS,HGSV,CORNELL,SWEGEN,KOGIC,FSA-LAB,CPQ_GEN_INCA,EVA_DECODE', 'SPDI': 'NC_000007.14:55181369:G:A,NC_000007.14:55181369:G:C', 'FXN_CLASS': 'non_coding_transcript_variant,coding_sequence_variant,missense_variant,synonymous_variant,genic_downstream_transcript_variant', 'VALIDATED': 'by-frequency,by-alfa,by-cluster', 'DOCSUM': 'HGVS=NC_000007.14:g.55181370G>A,NC_000007.14:g.55181370G>C,NC_000007.13:g.55249063G>A,NC_000007.13:g.55249063G>C,NG_007726.3:g.167339G>A,NG_007726.3:g.167339G>C,NM_005228.5:c.2361G>A,NM_005228.5:c.2361G>C,NM_005228.4:c.2361G>A,NM_005228.4:c.2361G>C,NM_005228.3:c.2361G>A,NM_005228.3:c.2361G>C,NM_001346899.2:c.2226G>A,NM_001346899.2:c.2226G>C,NM_001346899.1:c.2226G>A,NM_001346899.1:c.2226G>C,NM_001346900.2:c.2202G>A,NM_001346900.2:c.2202G>C,NM_001346900.1:c.2202G>A,NM_001346900.1:c.2202G>C,NM_001346941.2:c.1560G>A,NM_001346941.2:c.1560G>C,NM_001346941.1:c.1560G>A,NM_001346941.1:c.1560G>C,NM_001346898.2:c.2361G>A,NM_001346898.2:c.2361G>C,NM_001346898.1:c.2361G>A,NM_001346898.1:c.2361G>C,NM_001346897.2:c.2226G>A,NM_001346897.2:c.2226G>C,NM_001346897.1:c.2226G>A,NM_001346897.1:c.2226G>C,NR_047551.1:n.1201C>T,NR_047551.1:n.1201C>G,NP_005219.2:p.Gln787His,NP_001333828.1:p.Gln742His,NP_001333829.1:p.Gln734His,NP_001333870.1:p.Gln520His,NP_001333827.1:p.Gln787His,NP_001333826.1:p.Gln742His|SEQ=[G/A/C]|LEN=1|GENE=EGFR:1956,EGFR-AS1:100507500', 'TAX_ID': '9606', 'ORIG_BUILD': '86', 'UPD_BUILD': '155', 'CREATEDATE': '2000/10/05 19:10', 'UPDATEDATE': '2021/04/26 11:38', 'SS': '1524562,4415417,14121776,16263880,17933589,23461029,24778961,43094183,52067157,65740167,66861655,74802379,77320037,85017386,86269798,93684223,98280473,104430036,105110075,112044216,116097662,142664329,143002862,159714800,159903775,162355178,166625159,203360709,212046487,223092668,233988791,240940108,279318753,285632487,293873711,342235809,479680970,482283259,485456160,490945938,491906494,534578333,560020602,654383955,779491194,781710098,834961318,836317040,974464379,984303175,1067488412,1074630855,1325217930,1431130856,1584052633,1593883812,1618262516,1661256549,1688745702,1711163743,1805015052,1927546725,1966658513,2024460166,2152655954,2294217561,2463237960,2634609883,2708323052,2736449362,2747825596,2853377601,3001152883,3023062922,3026026429,3347597637,3530770690,3530770691,3629823416,3632517137,3636855046,3646352982,3648637018,3669078603,3719737966,3734653562,3766593686,3785824290,3791125584,3796005564,3809753344,3824277165,3825524129,3825539990,3825720179,3830585782,3838783084,3844235500,3867317995,3914396600,3961507699,3984367631,3984367632,3984588810,3985298625,3986039617,3986382885,4746722701,5183240605,5236855147,5236861046,5237033464,5237196188', 'ALLELE': 'V', 'SNP_CLASS': 'snv', 'CHRPOS': '7:55181370', 'CHRPOS_PREV_ASSM': '7:55249063', 'TEXT': 'MergedRs=1050171', 'SNP_ID_SORT': '0001050171', 'CLINICAL_SORT': '1', 'CITED_SORT': '', 'CHRPOS_SORT': '0055181370', 'MERGED_SORT': '1'}, attributes={'uid': '58115520'})], 'DbBuild': 'Build211004-0955.1'}, attributes={'status': 'OK'})}

(대충 하나 뽑은게 이정도)

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
handle = Entrez.esearch(db="snp", term="rs1050171", retmax="40" )
record = Entrez.read(handle)
IdList=list(record['IdList'])
# dbSNP+esearch
for i in range(len(record['IdList'])):
    handle = Entrez.esummary(db="snp", id=IdList[i], retmode="xml")
    records = Entrez.read((handle))
    print(records['DocumentSummarySet']['DocumentSummary'][0]['GLOBAL_MAFS'][0])
# 일단 esummary 써서 다 가져왔다. 근데 딕셔너리인데 왜 픽이 안될까...
# 그 전에 저렇게 꼭 4단픽까지 가야겠냐... 아니 진짜 너무한 거 아닙니까...

사실 5단이다. 딕셔너리가 아주 이중삼중이여... 

/home/koreanraichu/PycharmProjects/pythonProject/venv/bin/python "/home/koreanraichu/PycharmProjects/pythonProject/dbSNP in Entrez.py"
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}
{'STUDY': '1000Genomes', 'FREQ': 'A=0.432708/2167'}

다 똑같아보이는데 제대로 가져온 거 맞냐고요? 네, 맞습니다. 

 

ClinVar

dbSNP는 그 뭐지... single nucleotide polymoorphism, 그러니까 point mutation으로 인한 변이들을 기록해 둔 데이터베이스고 ClinVar는 거기에 대한 보고서에 접근할 수 있게 해 주는 public archive다. 

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
handle = Entrez.esearch(db="clinvar", term="EGFR", retmax="40" )
record = Entrez.read(handle)
print(record)
# clinVar+esearch

기본적으로는 이렇게 가져오고 

from Bio import Entrez
handle = Entrez.esearch(db="clinvar", term="L858R", retmax="40" )
record = Entrez.read(handle)
print(record['IdList'])
# clinVar+esearch
IdList=list(record['IdList'])
for i in range(len(record['IdList'])):
    handle = Entrez.esummary(db="clinvar", id=IdList[i], retmode="xml")
    records = Entrez.read((handle))
    print(records)
# 일단 esummary 써서 다 가져왔다.

일단 다 쓸어오자... (저 for문은 wordcloud 만들 때 썼던 걸 응용한거다)

 

{'DocumentSummarySet': DictElement({'DocumentSummary': [DictElement({'obj_type': 'single nucleotide variant', 'accession': 'VCV001314051', 'accession_version': 'VCV001314051.', 'title': 'NM_001243133.2(NLRP3):c.2744T>G (p.Leu915Arg)', 'variation_set': [{'measure_id': '1304312', 'variation_xrefs': [], 'variation_name': 'NM_001243133.2(NLRP3):c.2744T>G (p.Leu915Arg)', 'cdna_change': 'c.2744T>G', 'aliases': [], 'variation_loc': [{'status': 'current', 'assembly_name': 'GRCh38', 'chr': '1', 'band': '1q44', 'start': '247444052', 'stop': '247444052', 'inner_start': '', 'inner_stop': '', 'outer_start': '', 'outer_stop': '', 'display_start': '247444052', 'display_stop': '247444052', 'assembly_acc_ver': 'GCF_000001405.38', 'annotation_release': '', 'alt': '', 'ref': ''}, {'status': 'previous', 'assembly_name': 'GRCh37', 'chr': '1', 'band': '1q44', 'start': '247607354', 'stop': '247607354', 'inner_start': '', 'inner_stop': '', 'outer_start': '', 'outer_stop': '', 'display_start': '247607354', 'display_stop': '247607354', 'assembly_acc_ver': 'GCF_000001405.25', 'annotation_release': '', 'alt': '', 'ref': ''}], 'allele_freq_set': [], 'variant_type': 'single nucleotide variant', 'canonical_spdi': 'NC_000001.11:247444051:T:G'}], 'trait_set': [{'trait_xrefs': [{'db_source': 'MedGen', 'db_id': 'CN517202'}], 'trait_name': 'not provided'}], 'supporting_submissions': {'scv': ['SCV002002472'], 'rcv': ['RCV001771282']}, 'clinical_significance': {'description': 'Uncertain significance', 'last_evaluated': '2021/01/14 00:00', 'review_status': 'criteria provided, single submitter'}, 'record_status': '', 'gene_sort': 'NLRP3', 'chr_sort': '01', 'location_sort': '00000000000247444052', 'variation_set_name': '', 'variation_set_id': '', 'genes': [{'symbol': 'NLRP3', 'GeneID': '114548', 'strand': '+', 'source': 'submitted'}], 'protein_change': 'L801R, L858R, L915R, L917R', 'fda_recognized_database': ''}, attributes={'uid': '1314051'})], 'DbBuild': 'Build211201-1855.1'}, attributes={'status': 'OK'})}

(마른세수)


그래도 얘는 짤리진 않았음... 하지만 딕셔너리가 대체 몇중이지 하면서 쌍욕박는 건 똑같습니다... 

결국 차근차근 겜코딩 하던 머리로 차근차근 딕셔너리 마킹해가면서 찾음... (딕셔너리 다중일때는 ['상위 키']['하위 키'] 이런 식으로 찾으면 된다)

 

from Bio import Entrez
Entrez.email = "blackholekun@gmail.com"
handle = Entrez.esearch(db="clinvar", term="L858R", retmax="40" )
record = Entrez.read(handle)
# clinVar+esearch
IdList=list(record['IdList'])
for i in range(len(record['IdList'])):
    handle = Entrez.esummary(db="clinvar", id=IdList[i], retmode="xml")
    records = Entrez.read((handle))
    print(records['DocumentSummarySet']['DocumentSummary'][0]['title'])
# 일단 esummary 써서 다 가져왔다.
# 야이 근데 무슨 이건 진짜 할 말을 잃었음
NM_001243133.2(NLRP3):c.2744T>G (p.Leu915Arg)
NM_000245.4(MET):c.2573T>G (p.Leu858Arg)
NM_000222.3(KIT):c.2585T>G (p.Leu862Arg)
NM_005228.5(EGFR):c.2573_2574delinsGT (p.Leu858Arg)
NM_005228.5(EGFR):c.2572_2573inv (p.Leu858Arg)
NM_005228.5(EGFR):c.2369C>T (p.Thr790Met)
NM_005228.5(EGFR):c.2573T>G (p.Leu858Arg)

EGFR쪽에 보면 Leu858Arg 있는데 이게 L858R, Thr790Met이 T790M이다. (A는 알라닌)

'Coding > Python' 카테고리의 다른 글

번외편-코딩테스트 풀이 (3)  (0) 2022.08.21
번외편-코딩테스트 풀이 (2)  (0) 2022.08.21
심심해서 써보는 본인 개발환경  (0) 2022.08.21
Biopython-Q&A  (0) 2022.08.21
Biopython으로 KEGG 탐방하기  (0) 2022.08.21