Mastering Web Scraping with ChatGPT: A Comprehensive Guide
Written on
Understanding Web Scraping with ChatGPT
In a prior discussion, I showcased a method for web scraping using straightforward prompts for ChatGPT, such as "scrape website X using Python." However, this technique is not always effective.
After extensive attempts to scrape numerous sites with ChatGPT, I've determined that simple prompts like the aforementioned rarely yield results. Instead, I've identified a more effective strategy that enables us to scrape virtually any website using ChatGPT alongside some basic HTML knowledge.
The Importance of Using ChatGPT Playground
To efficiently scrape websites with ChatGPT, it's crucial to utilize the advanced version known as Playground. This version offers more flexibility and significantly faster code generation.
As illustrated below, Playground differs from the standard ChatGPT interface by providing enhanced customization options, particularly beneficial for code generation. This eliminates the common issues of restrictions and delayed responses.
To get started, we will input our prompts in the designated area beneath the "Playground" header.
Scraping Any Website Using ChatGPT
To demonstrate how to use ChatGPT for web scraping, we'll begin with a straightforward site called Subslikescript, which lists various movies. Later, I will guide you through the process of scraping content from platforms like Amazon and Twitter.
Let's say we aim to extract the movie titles displayed in the image above. First, we must inspect the website's elements by right-clicking and selecting "Inspect." This action will reveal the necessary HTML structure.
Next, we identify the relevant elements containing the data, which are marked with the tags ul, a, and the class scripts-list. Here’s the prompt we’ll use:
The following code will be generated:
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
scripts_list = soup.find(class_="scripts-list")
all_a_elements = scripts_list.find_all('a')
for element in all_a_elements:
print(element.get_text())
This successfully extracts all the movie titles! Now that we grasp the basics, let's move on to scraping Amazon.
Scraping Amazon with ChatGPT
Suppose you're interested in extracting self-help book titles from Amazon. Start by searching for "self-help books" on the Amazon website and copy the resulting link. The link I obtained might vary based on your location, so feel free to use my link for consistency.
Next, we'll inspect the elements containing the book titles. To extract this information, we will use the span tag along with the class attributes a-size-base-plus a-color-base a-text-normal.
In this instance, I'll leverage Selenium to automate the scraping process. The following instructions will include waiting for 5 seconds and building an XPath for the elements.
Here’s the prompt for Amazon:
The generated code will look like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from time import sleep
# Initialize webdriver
driver = webdriver.Chrome('')
# Navigate to the website
# Wait 5 seconds to let the page load
sleep(5)
# Locate all the elements with the specified XPath
elements = driver.find_elements(By.XPATH, '//span[@class="a-size-base-plus a-color-base a-text-normal"]')
# Get the text attribute of each element and print it
for element in elements:
print(element.text)
# Close the webdriver
driver.close()
This code effectively extracts all the book titles from Amazon!
Scraping Twitter with ChatGPT
Now, let’s say you wish to scrape tweets related to "ChatGPT." Start by searching for "ChatGPT" on Twitter and copying the resulting link.
We will again utilize Selenium to inspect the tweet elements. To extract tweets, we need to use the div tag along with the lang attribute.
Here’s the prompt for Twitter:
The generated code will be:
from selenium import webdriver
import time
driver = webdriver.Chrome("/path/to/chromedriver")
driver.maximize_window()
time.sleep(15)
elements = driver.find_elements_by_xpath("//div[@lang]")
for element in elements:
print(element.text)
driver.quit()
If you run this, you will retrieve the first few tweets from the search results. To collect more tweets, just add "scroll down X times" to your instructions.
Congratulations! You've learned how to scrape websites without writing code, simply by letting ChatGPT handle the heavy lifting.
Transforming Websites into Datasets
Join my email list with over 20k subscribers to receive my FREE Web Scraping Cheat Sheet!
This video demonstrates how to effectively use GPT-3.5 for web scraping, providing practical insights and examples.
This video offers a beginner-friendly introduction to web scraping using ChatGPT, ideal for those new to the topic.