A client approached me with a straightforward brief: they wanted to analyse engagement on a competitor's public Facebook page — specifically the comments people were leaving on posts and the overall performance of those posts over time. The goal was sentiment analysis and competitive intelligence, not anything invasive. The data was all publicly available. The problem was volume: manually copying hundreds of comments and posts was not an option.
My solution was two separate Python scrapers — one focused on extracting comments and threaded replies from a specific post, and one for scraping post-level data (text, likes, shares, comments count, content type) across an entire Facebook page. This article walks through both in detail.
What you will learn: How to automate Facebook login with Selenium, dynamically load comments and expand reply threads, parse the DOM with BeautifulSoup, detect content types (video, image, text), and export clean structured data to CSV — with real, production-tested code.
Tech Stack and Why Each Tool Was Chosen
Facebook is a JavaScript-heavy, dynamically rendered platform. A standard requests.get() call will not return the full DOM. You need a browser automation tool to render JavaScript before parsing. Here is what the stack looks like:
- Selenium: Drives the Microsoft Edge browser, handles login, scrolling, button clicks (expand replies, load more comments), and dynamic content rendering.
- BeautifulSoup: Parses the fully rendered HTML snapshot at each stage. Fast, reliable, and easier to query than Selenium's own element selection for complex nested structures.
- pandas: Structures extracted data into DataFrames and exports clean CSVs with proper UTF-8 encoding for international characters.
- Environment variables: Credentials are never hardcoded.
FB_EMAILandFB_PASSWORDare loaded from system environment variables — a basic but essential security practice.
Scraper 1: Facebook Comments & Replies
The first scraper targets a specific Facebook post URL and extracts every comment plus all threaded replies, along with author names, profile links, comment text, and reaction counts.
The challenge: dynamic loading
Facebook does not load all comments at once. It loads a subset, then requires user interaction — clicking "View more comments" and individual "X Replies" buttons — to reveal the rest. A scraper that only reads the initial page state will miss the vast majority of engagement data.
Initialise browser & login
Launch Edge with anti-detection flags. Navigate to facebook.com/login, enter credentials from environment variables, and wait for the home feed to load fully.
Navigate to target post & switch filter
Load the post URL, then switch the comment sort from "Most relevant" to "All comments" — this reveals chronological comments rather than only Facebook's algorithmically selected ones.
Scroll and load all comments
Continuously scroll to the bottom and click all "View more comments" / "View previous comments" buttons. Loop until the comment count is stable across 12 consecutive checks.
Expand all reply threads
Find every "X Replies" span element and click each one to expand threaded replies. Track already-clicked buttons to avoid infinite loops.
Parse HTML & export to CSV
Take a full HTML snapshot, parse with BeautifulSoup, separate comments from replies using ARIA labels, extract IDs from URLs, and write two deduplicated CSV files.
Output: what the comments scraper produces
| File | Field | Description |
|---|---|---|
| fb_comments.csv | comment_id | Unique ID extracted from comment URL |
| fb_comments.csv | author | Display name of the commenter |
| fb_comments.csv | profile_link | Facebook profile URL of commenter |
| fb_comments.csv | text | Full comment text |
| fb_comments.csv | reactions | Number of reactions on the comment |
| fb_replies.csv | reply_id | Unique ID of the reply |
| fb_replies.csv | parent_comment_id | ID of the parent comment (for threading) |
| fb_replies.csv | reply_author | Display name of the reply author |
| fb_replies.csv | text | Full reply text |
| fb_replies.csv | reactions | Reactions on the reply |
The comments scraper code
# ── IMPORTS ────────────────────────────────────────────────── import time, pandas as pd, re, os from bs4 import BeautifulSoup from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC class FacebookScraper: def __init__(self, email, password): self.email = email self.password = password self.driver = None self.clicked_buttons = set() def initialize_driver(self): options = webdriver.EdgeOptions() # Suppress automation flags that trigger detection options.add_argument("--disable-blink-features=AutomationControlled") options.add_experimental_option("excludeSwitches", ["enable-automation"]) self.driver = webdriver.Edge(options=options) self.driver.maximize_window() def login(self): self.driver.get("https://www.facebook.com/login") time.sleep(3) self.driver.find_element(By.NAME, "email").send_keys(self.email) self.driver.find_element(By.NAME, "pass").send_keys(self.password) self.driver.find_element(By.NAME, "login").click() time.sleep(15) # Allow full page load + any 2FA prompts def click_all_comments_filter(self): """Switch from 'Most relevant' to 'All comments' sort""" wait = WebDriverWait(self.driver, 10) dropdown = wait.until(EC.element_to_be_clickable(( By.XPATH, "//div[@role='button']//span[contains(text(), 'Most relevant')]" ))) dropdown.click() time.sleep(2) all_comments = wait.until(EC.element_to_be_clickable(( By.XPATH, "//span[text()='All comments']" ))) all_comments.click() time.sleep(5) def load_content(self): """Scroll and click 'View more comments' until stable""" last_count, stable = 0, 0 while stable < 12: self.driver.find_element(By.TAG_NAME, 'body').send_keys(Keys.END) self.driver.execute_script("window.scrollBy(0, 2000);") time.sleep(3) btns = self.driver.find_elements(By.XPATH, "//span[contains(text(),'View more comments') or " "contains(text(),'View previous comments')]" "/ancestor::div[@role='button']") for b in btns: self.driver.execute_script("arguments[0].click();", b) curr = len(self.driver.find_elements(By.XPATH, "//div[@role='article']")) if curr > last_count: last_count, stable = curr, 0 else: stable += 1 def expand_replies(self): """Click every 'X Replies' button to expand threads""" for _ in range(30): btns = self.driver.find_elements(By.XPATH, "//span[contains(text(),'repl') and not(contains(text(),'Write'))]") clicked = 0 for b in btns: if b.text not in self.clicked_buttons: self.driver.execute_script("arguments[0].click();", b) self.clicked_buttons.add(b.text) clicked += 1 time.sleep(0.5) if clicked == 0: break def parse_and_save(self): soup = BeautifulSoup(self.driver.page_source, "html.parser") comments_list, replies_list = [], [] for art in soup.find_all("div", role="article"): # Distinguish comment vs reply via ARIA label is_reply = "reply" in art.get("aria-label", "").lower() # ... (extraction logic) ... pd.DataFrame(comments_list).drop_duplicates().to_csv( "fb_comments.csv", index=False, encoding="utf-8-sig") pd.DataFrame(replies_list).drop_duplicates().to_csv( "fb_replies.csv", index=False, encoding="utf-8-sig") # ── USAGE ──────────────────────────────────────────────────── if __name__ == "__main__": EMAIL = os.getenv("FB_EMAIL") # Never hardcode credentials PASSWORD = os.getenv("FB_PASSWORD") scraper = FacebookScraper(EMAIL, PASSWORD) try: scraper.initialize_driver() scraper.login() scraper.navigate_to_post("https://www.facebook.com/your-target-post") scraper.click_all_comments_filter() scraper.load_content() scraper.expand_replies() scraper.parse_and_save() finally: scraper.driver.quit()
Key design decision: Credentials are loaded via os.getenv() rather than hardcoded. Set FB_EMAIL and FB_PASSWORD as system environment variables before running. This prevents credentials from accidentally appearing in version control or logs.
Scraper 2: Facebook Page Posts
The second scraper targets a Facebook page profile and extracts post-level data across a configurable number of posts. Rather than diving deep into a single post, it sweeps across the feed to capture metadata: what was posted, how it performed, and what type of content it was.
The challenge: disappearing posts
Facebook's feed has a quirk that makes it more complex than a typical paginated site: as you scroll down and new posts load, older posts disappear from the DOM to manage memory. This means you cannot scroll to the bottom of the page and parse everything at once. You need to extract data incrementally — scroll a little, parse, scroll more, parse again, deduplicate.
Content type detection
One of the more interesting engineering decisions in this scraper is the robust content type detection. Facebook does not give you a clean field saying "this is a video." You have to infer it from the DOM. The detect_content_type method works through a priority cascade:
- Check for a direct
<video>tag — clearest signal - Scan all tag attributes for the string "video" or "playable"
- Look for play button overlays via ARIA labels, titles, or CSS class names
- Check
role="presentation"wrappers (Facebook's standard video container) - Fall back to
<img>detection for image posts - Default to "text" if none of the above match
Important: Facebook updates its DOM structure frequently. Class names like x1n2onr6 and CSS selectors used here were valid at time of writing but may change. Web scrapers targeting dynamic platforms require periodic maintenance. Build in monitoring and alerts if you rely on this in production.
Output: what the posts scraper produces
| File | Field | Description |
|---|---|---|
| facebook_posts.csv | post_text | Full text content of the post |
| facebook_posts.csv | likes | Number of reactions on the post |
| facebook_posts.csv | comments | Comment count displayed on the post |
| facebook_posts.csv | shares | Share count displayed on the post |
| facebook_posts.csv | post_time | Relative timestamp (e.g. "3 hours ago") |
| facebook_posts.csv | content_type | Inferred type: video, image, or text |
| facebook_posts.csv | post_link | Direct URL to the post |
The posts scraper code
from selenium import webdriver from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from bs4 import BeautifulSoup import time, random, pandas as pd class FacebookScraper: def simulate_human_typing(self, element, text): """Randomised keypress timing to avoid bot detection""" for char in text: element.send_keys(char) time.sleep(random.uniform(0.1, 0.3)) if random.random() < 0.1: time.sleep(random.uniform(0.3, 0.7)) def detect_content_type(self, post): """Priority cascade: video > play overlay > image > text""" if post.find("video"): return "video" # Scan attributes for video/playable keywords for tag in post.find_all(attrs=True): for k, v in tag.attrs.items(): if isinstance(v, str) and ('video' in v.lower() or 'playable' in v.lower()): return "video" if post.find("img"): return "image" return "text" def extract_posts_with_bs(self): """Parse current DOM snapshot for post metadata""" soup = BeautifulSoup(self.driver.page_source, "html.parser") posts_data = [] posts = soup.find_all("div", {"class": "x1n2onr6 x1ja2u2z"}) for post in posts: post_text = " ".join([ msg.get_text(strip=True) for msg in post.find_all("div", {"data-ad-preview": "message"}) ]) # Extract post link (posts, videos, reels, photos, stories) post_link = None for a in post.find_all("a", href=True): if any(p in a["href"] for p in ["/posts/", "/videos/", "/reel/", "/photo/"]): post_link = "https://www.facebook.com" + \ a["href"].split("?")[0] \ if a["href"].startswith("/") else a["href"].split("?")[0] break posts_data.append({ "post_text": post_text, "content_type": self.detect_content_type(post), "post_link": post_link # + likes, comments, shares, post_time }) return posts_data def scrape_posts(self, max_posts): """Scroll-parse-deduplicate loop until target count reached""" all_posts = [] while len(all_posts) < max_posts: posts = self.extract_posts_with_bs() all_posts.extend(posts) all_posts = self.remove_duplicates(all_posts) self.slow_scroll() return all_posts[:max_posts] # ── USAGE ──────────────────────────────────────────────────── if __name__ == "__main__": scraper = FacebookScraper("your_email", "your_password") try: scraper.initialize_driver() scraper.login() scraper.navigate_to_profile("https://www.facebook.com/targetpage") df = scraper.save_to_csv(scraper.scrape_posts(max_posts=600)) finally: scraper.close()
What the Client Used This Data For
The two CSV outputs fed directly into a downstream analysis pipeline. The comments data was used for sentiment analysis — understanding how the audience felt about different post types and topics. The post-level data revealed which content formats (video vs image vs text) were generating the highest engagement, and at what times of day. Together, they gave the client a detailed picture of their competitor's content strategy that would have been impossible to build manually.
Anti-Detection: What Keeps the Scraper Running
Facebook has active bot detection. Here are the techniques built into both scrapers to reduce detection risk:
- AutomationControlled flag disabled: The
--disable-blink-features=AutomationControlledargument removes the JavaScriptnavigator.webdriverproperty that websites use to detect headless browsers. - Human-like typing: The posts scraper uses
simulate_human_typing()with randomised delays between keystrokes — mimicking the irregular rhythm of human input rather than instant field filling. - Sleep intervals throughout: Deliberate
time.sleep()calls after every major action — login, navigation, filter switching, button clicks — prevent the rapid-fire request patterns that trigger rate limiting. - JavaScript-based clicks: Buttons like "View more comments" are clicked via
execute_script("arguments[0].click()")rather than Selenium's.click()method, which is more reliable on dynamically rendered elements and less detectable. - Stability checks: The comment loader uses a stability counter rather than a fixed scroll count, ensuring the scraper only stops when the content has genuinely stopped loading — not just when it ran out of iterations.
Rate limiting matters more than anything else. If you exceed roughly a hundred requests per hour, Facebook slows down responses significantly. Both scrapers are built with conservative sleep timings. If you plan to run at higher volume, reduce concurrency and increase sleep intervals rather than trying to push through rate limits.
Frequently Asked Questions
Can you scrape Facebook with Python and Selenium?
Yes. Facebook is a JavaScript-heavy platform that requires a browser automation tool like Selenium to render dynamic content before parsing it with BeautifulSoup. Simple requests-based scrapers will not work because the DOM is not fully loaded without JavaScript execution.
Is it legal to scrape Facebook data?
Scraping publicly available data is generally considered legal in many jurisdictions. The 2022 Ninth Circuit ruling affirmed that scraping public data does not violate the Computer Fraud and Abuse Act. However, it may still violate Facebook's Terms of Service, which can result in account bans. Always consult a legal professional for your specific use case and only scrape publicly available data.
How do I scrape Facebook comments and replies?
You need Selenium to log in, scroll through the post, click "View more comments" and "View replies" buttons dynamically, then use BeautifulSoup to parse the fully loaded HTML. The key challenge is correctly identifying comment articles versus reply articles using ARIA labels in the DOM, and building a stability loop to confirm all content has loaded.
What data can you extract from a Facebook post scraper?
From Facebook page posts you can extract: post text, number of likes, comments count, shares count, time posted, content type (video, image, or text), and the direct post link. This data is useful for sentiment analysis, competitor monitoring, content strategy research, and social media audits.
How do I avoid getting blocked when scraping Facebook?
Key strategies include: disabling Selenium's automation control flags, using randomised typing delays, adding generous sleep intervals between actions, using JavaScript-based element clicks, and keeping request volume below roughly 100 per hour. Residential proxies can further reduce detection risk for high-volume scraping.