有一個文本,里面存放了很多的字符串,有的是以http開頭的,有些不是,如何過濾出url呢?
比如一個文本test.txt,里面的內容為:
http://www.sogou.com this is a url this is http://www.sogou.com address
第一種方式是,判斷包含:
#encoding: utf-8 with open("test.txt", "r") as f: content = f.readlines() for line in content: if "http" in line: print(line)
輸出為:
http://www.sogou.com this is http://www.sogou.com address
如果只獲取以http開頭的,那么:
#encoding: utf-8 import re with open("test.txt", "r") as f: content = f.readlines() for line in content: r = re.match("http", line) if r != None: print(line)
輸出為:
http://www.sogou.com
re.match, 從開頭匹配字符串,如果匹配到返回匹配到的對象。沒有匹配到返回None。
有沒有更簡單的方式呢?
#encoding: utf-8 with open("test.txt", "r") as f: content = f.readlines() for line in content: if line.startswith("http"): print(line)
同樣輸出為:
http://www.sogou.com
既然有startswith,那么有沒有判斷結尾的呢?
答案是當然的。
#encoding: utf-8 with open("test.txt", "r") as f: content = f.readlines() for line in content: if line.replace("n","").endswith("com"): print(line)
這里要注意的是,每行結束會有一個換行符,因此要替換掉。
雖然從代碼行數上,區別不是太大,但是從方法名稱的理解上,startswith和endswith,更容易一些。
如果要匹配多個字符怎么辦?
比如文本內容為:
http://www.sogou.com this is a url this is http://www.sogou.com address ftp://www.sogou.com
#encoding: utf-8 with open("test.txt", "r") as f: content = f.readlines() for line in content: if line.startswith(("http", "ftp")): print(line)
只需要傳參數為元組,包含要匹配的字串即可。