barcode

Cutter 기능 추가: 정규식 도입

Coding/Python

Finder는 도입하려면 좀 걸립니다... 얘는 아예 정규식+찾아바꾸기가 필요한거라...... 


정규식 도입

import pandas as pd
import re
from datetime import datetime
enzyme_table = pd.read_csv('/home/koreanraichu/restriction.csv')
enzyme_table2 = pd.read_csv('/home/koreanraichu/restriction_RE.csv')
# 정규식 도입을 위해... 어쩔 수 없이 합쳤음... 
enzyme_table = pd.concat([enzyme_table,enzyme_table2])
enzyme_table = enzyme_table.sort_values('Enzyme')
enzyme_table.reset_index(inplace=True)
# 합쳤다... 
print(enzyme_table)

뭐가 좀 많은데, RE 들어가는 csv파일이 정규식 처리가 필요한 효소들. 인식 시퀀스에 N, S, B같은 게 있는 효소들이다. 정규식 없이 find를 쓰면 못 찾기때문에 뺐다. 

 

class RE_treatment:
    def RE_wildcard(self,before_seq):
        self.before_seq = before_seq
        before_seq = before_seq.replace("N",".")
        return before_seq
    # Wildcard: 시퀀스 데이터에 N이 있을 경우 Wildcard로 바꾼다. 
    def RE_or(self,before_seq):
        self.before_seq = before_seq
        if "B" in before_seq:
            before_seq = before_seq.replace("B","[CGT]")
        elif "D" in before_seq:
            before_seq = before_seq.replace("D","[AGT]")
        elif "H" in before_seq:
            before_seq = before_seq.replace("H","[ACT]")
        elif "K" in before_seq:
            before_seq = before_seq.replace("K","[GT]")
        elif "M" in before_seq:
            before_seq = before_seq.replace("M","[AC]")
        elif "R" in before_seq:
            before_seq = before_seq.replace("R","[AG]")
        elif "S" in before_seq:
            before_seq = before_seq.replace("S","[CG]")
        elif "V" in before_seq:
            before_seq = before_seq.replace("V","[ACG]")
        elif "W" in before_seq:
            before_seq = before_seq.replace("W","[AT]")
        elif "Y" in before_seq:
            before_seq = before_seq.replace("Y","[CT]")
        return before_seq
    # Or: 시퀀스 데이터에 N 말고 ATGC 말고 다른 알파벳이 있을 경우, 해당하는 정규식 문법으로 바꾼다.

클래스. 대충 쿠키틀 해당 클래스는 N, D, B와 같은 알파벳들을 정규식 처리 하는 코드를 담고 있다. 

 

def convert (a):
    RE = RE_treatment()
    while True:
        if "N" in res_find:
            res_find_after = RE.RE_wildcard(res_find)
        elif "B" in res_find or "D" in res_find or "H" in res_find or "K" in res_find or "M" in res_find or "R" in res_find or "S" in res_find or "V" in res_find or "W" in res_find or "Y" in res_find: 
            res_find_after = RE.RE_or(res_find)
        else: 
            break
        return res_find_after

무식하게 조건문 때려박은 if문... 저 While True가 없으면 GCDGHC처럼 알파벳이 여러글자일 때에 대한 처리가 안된다. (N은 아예 일괄적으로 wildcard화 함)

 

if "N" in res_find or "B" in res_find or "D" in res_find or "H" in res_find or "K" in res_find or "M" in res_find or "R" in res_find or "S" in res_find:
    res_find_after = str(convert(res_find))
else: 
    res_find_after = res_find
# 정규식 처리(문자가 두 개 이상일때에 대한 처리가 필요함)
Findall = re.findall(res_find_after,sequence)
if Findall: 
    count += 1
    site_count = len(Findall)
    if site_count == 1:
    once_cut_list.append(enzyme)
elif site_count == 2: 
    two_cut_list.append(enzyme)
else: 
    multi_cut_list.append(enzyme)
res_loc_list = ', '.join(res_loc_list)
f.write("{0}: {1} {2},{3} times cut. Where(bp): {4} \n".format(enzyme,res_find,feature,site_count,res_loc_list))

(그래서 대충 문제의 코드)

 

어디 자르는지 세 주는 기능

얘는 쉽다. 

 

def cut_func (a,b):
    global res_loc_list
    locs = re.finditer(a,b)
    for i in locs:
        loc = i.start()
        res_loc_list.append(str(loc))
    return res_loc_list
# 여기가 위치 관련 함수입니다.

finditer()로 다 찾은 다음, 해당 레코드에서 start(시작 지점)을 리스트화하면 된다.

결과 

이제 Finder에 도입한 다음 FASTA까지 하면 어지간한건 다 된다.