Web crawling with Python Selenium
🕒 2025-04-27 10:13:59.206333Selenium is the most popular framework for automation.
What will you learn?
- What is web crawling?
- Crawl a Facebook website and perform some action on it.
- Explore each of the code
What is web crawling?
Web crawling is a technique for automating certain web tasks with the help of an automated script. It crawls the web in order to index content for search engines. We can also call it a spider or search engine bot.
Let's crawl a Facebook website and perform some action on it
I'm going to crawl the Facebook website and perform a few actions. Please keep in mind that this post is purely for educational purposes. You must be familiar with the Facebook account. Crawling Facebook sites may be a good place to start. If you execute the script repeatedly, Facebook will recognize whether it is being done by a human or a bot and mark your request as Spam. So, execute this script periodically. The most important thing is to understand what this code is doing.
Task:
- Log in to the Facebook website with credentials username and password.
- Go to the user profile.
- Post something on Facebook.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
options = webdriver.ChromeOptions()
svc = Service("./opt/chromedriver")
options.binary_location = "./opt/chrome-linux/chrome"
# options.add_argument("--headless")
options.add_experimental_option("detach", True)
options.add_argument("--window-size=1520,800")
options.add_argument("--disable-notifications")
driver = webdriver.Chrome(service=svc, options=options)
driver.implicitly_wait(10)
def login(username, passwd):
try:
driver.get('https://www.facebook.com')
txtUsername = driver.find_element(By.ID, 'email')
txtUsername.send_keys(username)
txtPasswd = driver.find_element(By.ID, 'pass')
txtPasswd.send_keys(passwd)
btnLogin = driver.find_element(By.NAME, 'login')
btnLogin.submit()
driver.get(f"https://www.facebook.com/{username}")
# verify login
# there must be 'Add to story' button after login
driver.find_element(
By.XPATH, "//a[@aria-label='Add to story']")
return True
except Exception as e:
print("Unable to login. Please check your username or password once again.")
print(e)
return False
def post_on_facebook(post_text):
try:
buttons = driver.find_elements(
By.XPATH, "//div[@role='button']")
for button in buttons:
if button.text == "What's on your mind?":
button.click()
activepostarea = driver.switch_to.active_element
activepostarea.send_keys(
post_text)
postBtn = driver.find_element(
By.XPATH, "//div[@aria-label='Post']")
postBtn.click()
print("Successfully posted in Facebook.")
except:
print("Something went wrong. Unable to post in Facebook.")
if __name__ == '__main__':
username = input("Enter facebook username: ")
passwd = input("Enter facebook password: ")
isLogin = login(username, passwd)
if isLogin:
post_text = input("What you want to post: ")
post_on_facebook(post_text)
Let's explore the code mentioned above in more detail.
Step 1: Import required packages
Install selenium with Pip. Selenium is an open-source package for automating certain tasks.
pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
The 'By' module is used to locate elements by class name, ID, tags, CSS selector, and so on. 'Service' is used to take selenium service like configuring chrome driver.
Step 2: Configuring Chrome driver
I will mention two ways to configure the Chrome driver.
- Install chrome driver at run time.
- Provide the Chrome driver path
The Chrome driver will be downloaded during runtime if it has not already been downloaded. However, if you wish to deploy your code somewhere where it will be difficult to download anything at run time, these are not viable solutions. However, configuring the Chrome driver with this option is straightforward. For this, you need to install webdriver-manager from pip.
pip install webdriver-manager
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(
ChromeDriverManager().install()), options=chrome_options)
The second method is to provide the chrome driver path. I have provided a path in the above sample code. For this, you need to install Chromium and a Chrome driver manually.
To download Chromium specific to your Operating system:
https://commondatastorage.googleapis.com/chromium-browser-snapshots/index.html
To download Chrome driver specific to your Operating system:
https://chromedriver.storage.googleapis.com/index.html?path=114.0.5735.90/
Step 3: Headless browser
Now it's up to you to run your script headless or not. In a headless browser, there will be no graphical user interface.
To run a script in headless, add an option in Chrome options:
options.add_argument("--headless")
If you prefer a graphical user interface, simply remove or comment on this line, as I have done above.
And, if you also want to keep your browser open after the execution is completed, simply add the below option:
options.add_experimental_option("detach", True)
Step 4: Window size
It is compulsory to add window size even if you are executing your script in headless mode. It is because Selenium may sometimes consider your program to be for a mobile device.
options.add_argument("--window-size=1520,800")
In GUI, we can also maximize the window to its full size.
driver.maximize_window()
Step 5: Implicit or Explicit wait
The wait function in Selenium is important to ensure the uninterrupted operation of our program. It's important to give our driver some extra time to make sure everything is loaded properly. For this, there are primarily two approaches.
The implicit and explicit wait enable the selenium Webdriver to wait for a specific length of time before throwing an exception. They both serve the same function. The difference is that implicit wait is defined just once throughout the program, whereas explicit wait can be defined several times and it makes Webdriver wait until a certain condition is met.
I'm using implicit wait in the above program. As a result, if the driver is unable to locate any elements, it will always wait for a predefined amount of times.
driver.implicitly_wait(10)
Sample code for defining explicit wait:
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located(
(By.ID, "datePicker")))
Which wait is better? Both of them are very useful. Both of them do not wait for the full amount of time if a given condition has already been met. In a basic application, I believe the implicit wait is preferable because we only need to define it once throughout the program. As a result, we can concentrate only on other parts of the code.
However, for complex programs or if your program may need to be extended later, an explicit wait is a better technique. It's because you don't always have to wait the same length of time throughout the program. You could also need to take action only when an exception occurs. So, in my opinion, implicit wait is better for a basic program while explicit wait is better for a large one. By the way, I am a software engineer and I am currently working on web crawling. It is completely up to you to make your choice.
Step 6: Facebook crawler
I have already provided you with a basic and important concept in Selenium. Now, it's time to discuss our main project.
Here I am crawling the Facebook website. To locate an element, we must first determine how we can uniquely locate that element, such as by Id, class name, and so on.
driver.get('https://www.facebook.com')
txtUsername = driver.find_element(By.ID, 'email')
txtUsername.send_keys(username)
txtPasswd = driver.find_element(By.ID, 'pass')
txtPasswd.send_keys(passwd)
btnLogin = driver.find_element(By.NAME, 'login')
btnLogin.submit()
After navigating to facebook.com, I send keys (username and password) to the required element and then I click the submit button.
Everything is clear in the code now. After executing the script, the bot will post on Facebook.
Conclusion
This is how you can automate a Selenium task. I've made it quite clear how you can log into Facebook automatically and post stuff there.
I hope this post is clear to you. Please leave feedback below if you have any queries. I will reply as soon as possible. Thanks mate.
Comments
Loading comments...
Leave a Comment