python - Algorithm for searching text from corrupted file -
i have search tags text file damaged, file damaged data has changed(some character deleted , have been modified). example, have search tag -> "no of pages"
text file 1:
bhaskar rao mukku (57)abstract in system 2 pedal rods pedals, 1 side balls based axle, hollowed secondary axle, counter axle, 2 splined gear wheels has 2 clutch pin holes on circular pitch, 2 splined gear wheels has ratchet gears on circular pitch, sprocket wheel, 4 clutch pins , liver used convert ordinary bicycle gear bicycle. number page : 10
text file 2:
bhaskar rao mukku (57)abstract in system 2 pedal rods pedals, 1 side balls based axle, hollowed secondary axle, counter axle, 2 splined gear wheels has 2 clutch pin holes on circular pitch, 2 splined gear wheels has ratchet gears on circular pitch, sprocket wheel, 4 clutch pins , liver used convert ordinary bicycle gear bicycle. no. of pages: 10
text file 3:
bhaskar rao mukku (57)abstract in system 2 pedal rods pedals, 1 side balls based axle, hollowed secondary axle, counter axle, 2 splined gear wheels has 2 clutch pin holes on circular pitch, 2 splined gear wheels has ratchet gears on circular pitch, sprocket wheel, 4 clutch pins , liver used convert ordinary bicycle gear bicycle. no of pages: 10
above sample of text files. can see in above files word number has been modified 3 different forms, these 3 files, code must output corresponding bold words.
what have tried till find longest common subsequence between tag , continuous string text file (of length equal of tag) , calculated percentage of characters matched , if percentage >85 code output continuous string.
my code
def lcs(s,t): m = len(s) n = len(t) counter = [[0]*(n+1) x in range(m+1)] longest = 0 lcs_set = set() in range(m): j in range(n): if s[i] == t[j]: counter[i+1][j+1] = counter[i][j]+1 else: counter[i+1][j+1]=max(counter[i+1][j],counter[i][j+1]) return counter[m][n] def match(word,tag): word=modify(word) tag=modify(tag) sq=lcs(word,tag) return(float(float(sq)/float(max(len(word),len(tag))))) i=0 start=end=0 #records position of matched tag in string p=0.85 #percentage while <len(string): #string contains text file j=i while j <i+len(tag)+7:#tag tag want search arr=match(string[i:j+1],tag) #print(str(p)+" "+str(arr)+' '+string[i:j+1]+' '+str(i)) if (arr>p): p=arr start=i end=j elif(p==arr): p=arr if(end-start>=j-i): start=i end=j j+=1 i+=1
but codes fails when many cases such text file 1.is there other way searching more accurately , efficiently.
Comments
Post a Comment