Python实现网页搜索和数据提取的示例详解
作者:步子哥
在当今信息化的时代,获取信息变得越来越简单,借助编程,我们可以快速实现网页搜索和数据提取。本文将通过Python代码实现与Google及维基百科等网站的互动,帮助用户获取所需信息。
1. Google搜索功能
我们首先需要实现一个能够与Google进行交互的搜索功能。以下是实现这一功能的代码示例:
def google_search(query: str) -> str:
"""
google search with query, return a result in string
"""
import os
import json
import requests
SERPER_API_KEY = os.environ.get('SERPER_API_KEY', None)
if SERPER_API_KEY is None:
raise Exception('Please set SERPER_API_KEY in environment variable first.')
url = "https://google.serper.dev/search"
payload = json.dumps({"q": query})
headers = {
'X-API-KEY': SERPER_API_KEY,
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
json_data = json.loads(response.text)
return json.dumps(json_data, ensure_ascii=True, indent=4)
在这个函数中,我们使用了requests库来发送HTTP请求。首先,我们需要从环境变量中获取API密钥,以便能够访问Serper API。然后构建请求的URL和负载体,最终返回搜索结果。
例子
假设我们想要搜索“成都 人口”,可以调用上述函数:
result = google_search('成都 人口')
print(result)
2. 维基百科搜索功能
除了Google搜索,我们还可以实现一个与维基百科互动的搜索功能。以下是该功能的实现代码:
def wikipedia_search(query: str) -> str:
"""
wikipedia search with query, return a result in string
"""
import requests
from bs4 import BeautifulSoup
def get_page_obs(page):
paragraphs = page.split("\n")
paragraphs = [p.strip() for p in paragraphs if p.strip()]
sentences = []
for p in paragraphs:
sentences += p.split('. ')
sentences = [s.strip() + '.' for s in sentences if s.strip()]
return ' '.join(sentences[:5])
def clean_str(s):
return s.replace("\xa0", " ").replace("\n", " ")
entity = query.replace(" ", "+")
search_url = f"https://en.wikipedia.org/w/index.php?search={entity}"
response_text = requests.get(search_url).text
soup = BeautifulSoup(response_text, features="html.parser")
result_divs = soup.find_all("div", {"class": "mw-search-result-heading"})
if result_divs:
result_titles = [clean_str(div.get_text().strip()) for div in result_divs]
obs = f"Could not find {query}. Similar: {result_titles[:5]}."
else:
page = [p.get_text().strip() for p in soup.find_all("p") + soup.find_all("ul")]
if any("may refer to:" in p for p in page):
obs = wikipedia_search("[" + query + "]")
else:
page_content = ""
for p in page:
if len(p.split(" ")) > 2:
page_content += ' ' + clean_str(p)
if not p.endswith("\n"):
page_content += "\n"
obs = get_page_obs(page_content)
if not obs:
obs = None
return obs
在这个函数中,我们使用BeautifulSoup库解析维基百科搜索结果,并提取相关信息。
例子
如果我们想查找“Python 语言”的信息,可以使用以下代码:
result = wikipedia_search('Python 语言')
print(result)
3. 使用Selenium获取HTML内容
有时,网页中的内容是通过JavaScript动态加载的,简单的HTTP请求无法获取这些内容。此时,我们可以使用Selenium进行网页操作。以下是实现这一功能的代码:
def _web_driver_open(url: str, wait_time=10, scroll_to_bottom=False):
"""
open a web page in browser and wait the page load completely, return the Selenium 4 driver.
"""
import os
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
import time
CHROME_GRID_URL = os.environ.get('CHROME_GRID_URL', None)
if CHROME_GRID_URL is not None:
chrome_options = Options()
driver = webdriver.Remote(command_executor=CHROME_GRID_URL, options=chrome_options)
else:
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36")
webdriver_service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
driver.get(url)
driver.implicitly_wait(wait_time)
if scroll_to_bottom:
last_height = driver.execute_script("return document.body.scrollHeight")
for _ in range(2):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3)
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
return driver
在这个函数中,我们初始化了Selenium WebDriver并打开指定的URL。如果需要,可以选择性地滚动页面以加载更多内容。
获取HTML内容
使用以下函数可以获取网页的清晰HTML内容:
def _web_driver_get_html(driver) -> str:
"""
return clear html content (without script, style and comment) of the Selenium 4 driver, the driver should be ready.
"""
from bs4 import BeautifulSoup, Comment
from urllib.parse import urljoin
url = driver.current_url
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
for script_or_style in soup(['script', 'style']):
script_or_style.decompose()
for comment in soup(text=lambda text: isinstance(text, Comment)):
comment.extract()
for tag in soup(['head', 'meta', 'link', 'title', 'noscript', 'iframe', 'svg', 'canvas', 'audio', 'video', 'embed', 'object', 'param', 'source', 'track', 'map', 'area', 'base', 'basefont', 'bdi', 'bdo', 'br', 'col', 'colgroup', 'datalist', 'details', 'dialog', 'hr', 'img', 'input', 'keygen', 'label', 'legend', 'meter', 'optgroup', 'option', 'output', 'progress', 'select', 'textarea']):
tag.decompose()
for tag in soup(['div', 'span']):
tag.attrs = {}
for a in soup.find_all('a', href=True):
a['href'] = urljoin(url, a['href'])
for img in soup.find_all('img', src=True):
img['src'] = urljoin(url, img['src'])
html = str(soup)
return html
例子
我们可以使用以下代码获取指定网页的HTML内容:
html_content = web_get_html('https://example.com')
print(html_content)
4. 获取网页文本内容
如果只需要获取网页的文本内容,可以使用以下函数:
def web_get_text(url:str, wait_time=10, scroll_to_bottom=True):
"""
获取网页的文本内容
"""
import logging
driver = None
try:
driver = _web_driver_open(url, wait_time, scroll_to_bottom)
text = driver.execute_script("return document.body.innerText")
return text
except Exception as e:
logging.exception(e)
return 'Some Error Occurs:\n' + str(e)
finally:
if driver is not None:
driver.quit()
例子
调用这个函数获取网页的文本内容:
text_content = web_get_text('https://example.com')
print(text_content)
结论
通过以上代码示例,我们展示了如何使用Python实现网页搜索和数据提取功能。这些技术可以广泛应用于信息收集、数据分析等领域。随着技术的不断发展,我们期待未来能有更高效的方式来获取和分析信息。
到此这篇关于Python实现网页搜索和数据提取的示例详解的文章就介绍到这了,更多相关Python网页搜索与数据提取内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!
