BeautifulSoup

Focused BeautifulSoup recipes: reading attributes, element text, nested finds, find_all, CSS selectors, and stripped_strings.

BeautifulSoup — read a tag attribute (CSRF token, id, href)

Finds a tag by one attribute, then indexes it like a dict to read another attribute. The core technique for grabbing a CSRF token or object id before replaying a request.

find(tag, {attr: val}) returns the first element matching that attribute; indexing the result like a dict (["value"], ["href"]) reads any attribute off it. This is how a CSRF token, a hidden object id, or a download link is pulled out of a page before replaying it in the next request. Chaining .strip() / .split() cleans the value.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

csrf = soup.find("input", {"id": "csrf"})["value"]            # by id
name = soup.find("input", {"name": "csrf"})["value"]          # by name attribute
href = soup.find("a", {"target": "_blank"})["href"]           # read the href
fid  = soup.find("button", {"class": "delete-btn"})["value"].strip()

Markup these calls target

<input id="csrf" name="csrf" value="9f8a1c">
<a target="_blank" href="/files/report.pdf">Open</a>
<button class="delete-btn" value="42">Delete</button>

What each variable holds

csrf -> "9f8a1c"
name -> "9f8a1c"
href -> "/files/report.pdf"
fid  -> "42"

_{Find by: beautifulsoup, bs4, attribute, value, csrf token, hidden input, href, find by id, find by name, find by class, scrape token, grab id · Source: PG/Monster, WSA, PG/Zipper, PG/WallpaperHub}

BeautifulSoup — read element text (command output, reflected value)

Pulls the visible text out of an element with get_text().strip(); split/cast slices a value out of a label like “Balance: $250”.

.get_text() returns the visible text of the first matching element; .strip() trims surrounding whitespace. This is how a command’s output, a reflected input, or a status message is read back out of the response. When the value is wrapped in a label (Balance: $250), .split(": ")[1].strip("$") drops the label and int(...) makes it usable in arithmetic — exactly what a balance/race loop needs.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

output = soup.find("span").get_text()                          # raw text of first <span>
result = soup.find("div", {"class": "divmin"}).get_text().strip()   # by class, trimmed
heading = soup.find("h4").get_text().strip()

# value embedded in a label -> split off the label, cast to a number
balance = int(soup.find("strong").get_text().split(": ")[1].strip("$"))

Markup these calls target

<span>uid=33(www-data) gid=33(www-data)</span>
<div class="divmin">  root:x:0:0:root:/root:/bin/bash  </div>
<h4>config.php</h4>
<strong>Balance: $250</strong>

What each variable holds

output  -> "uid=33(www-data) gid=33(www-data)"
result  -> "root:x:0:0:root:/root:/bin/bash"
heading -> "config.php"
balance -> 250            (an int, ready for arithmetic)

_{Find by: beautifulsoup, bs4, get_text, element text, inner text, command output, reflected value, read response, strip, parse number, split value · Source: PG/XposedAPI, CWEE/Prototype Pollution, CWEE/Second Order, CWEE/Gift Card}

BeautifulSoup — drill into a nested element (chained find)

Chains find() to narrow to a container, then searches inside it. The class_= keyword form applies when the target is only unique within a parent.

find() can be chained: the first call returns a container element, and calling .find() on that result searches only inside it. The class_= keyword is a Python-friendly alias for {"class": ...} (because class is a reserved word). This applies when the target element (<p>) is not unique on the page but is unique inside a known parent (div.card-content).

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

# first find() narrows to the container, the second searches inside it
content = soup.find("div", class_="card-content").find("p").get_text().strip()

Markup this targets

<div class="card-content">
  <h5>report.txt</h5>
  <p>SECRET{nested_value}</p>
</div>

Result

content -> "SECRET{nested_value}"

_{Find by: beautifulsoup, bs4, nested find, chained find, class_, find within, parent child, drill into, container, scoped search · Source: CWEE/Second Order LFI}

BeautifulSoup — find_all then pick by index

find_all returns every match as a list; indexing the needed occurrence ([1], [-1]) applies when only the Nth copy of a repeated tag holds the data.

find_all(tag) returns a list of every matching element (unlike find, which returns only the first). Indexing it ([1], [-1]) applies when a page repeats a tag and only one position carries the needed value — a common shape for an in-band XPath or SQLi dump where the injected row lands at a fixed offset. .get_text(strip=True) is then called on the chosen element.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

second = soup.find_all("center")[1].get_text()        # 2nd <center>
last   = soup.find_all("td")[-1].get_text(strip=True)  # last <td>

Markup this targets

<center>Header</center>
<center>[email protected]</center>
<table><tr><td>id</td><td>0042</td></tr></table>

What each variable holds

second -> "[email protected]"
last   -> "0042"

_{Find by: beautifulsoup, bs4, find_all, index, nth match, second element, last element, list of elements, td, center, in band dump · Source: CWEE/XPath in-band}

BeautifulSoup — CSS selectors with select() / select_one()

select() takes any CSS selector and returns a list; select_one() returns the first. Applies when an attribute filter is not expressive enough (combinators, descendants, ids).

select() accepts a full CSS selector (a.btn, section.list > div, #main p) and returns a list of matches; select_one() returns just the first. The > combinator means direct child. This is the cleanest way to scrape a grid or table when the rows are identified by their position in the DOM rather than a single attribute.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

tiles = soup.select("section.container-list-tiles > div")  # direct children -> list
first = soup.select_one("section.container-list-tiles > div")
names = [d.get_text(strip=True) for d in tiles]

Markup this targets

<section class="container-list-tiles">
  <div>Laptop</div>
  <div>Mouse</div>
</section>

What each variable holds

tiles -> [<div>Laptop</div>, <div>Mouse</div>]   (2 elements)
first -> <div>Laptop</div>
names -> ["Laptop", "Mouse"]

_{Find by: beautifulsoup, bs4, css selector, select, select_one, direct child, combinator, class selector, query, list of nodes, scrape grid · Source: WSA SQLi in-band}

BeautifulSoup — collect a whole column (find_all + comprehension)

find_all on the repeated tag with a get_text(strip=True) comprehension scrapes an entire column at once — the backbone of a boolean/diff oracle.

Scraping a whole column — every product name, every row label — uses find_all on the repeated tag and builds a list with a comprehension over get_text(strip=True). strip=True trims whitespace per element so comparisons are exact. Diffing this list between a baseline request and an injected one is the core of a boolean / content-based SQLi oracle.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

# one-liner: every <h3>'s text as a clean list
titles = [t.get_text(strip=True) for t in soup.find_all("h3")]

# expanded equivalent
titles = []
for t in soup.find_all("h3"):
    titles.append(t.get_text(strip=True))

Markup this targets

<h3>Laptop</h3>
<h3>Mouse</h3>
<h3>Keyboard</h3>

Result

titles -> ["Laptop", "Mouse", "Keyboard"]

_{Find by: beautifulsoup, bs4, find_all, list comprehension, collect column, scrape all, get_text strip, all matches, loop, build list, oracle list, diff results · Source: WSA SQLi in-band}

BeautifulSoup — separate text fragments with stripped_strings

When one element holds several separate text fragments, stripped_strings yields each one trimmed — unlike get_text() which concatenates them. Filtering isolates the needed piece.

When a single element wraps several distinct text fragments (a name, a price, a label), .get_text() glues them into one string. .stripped_strings instead yields each fragment as its own whitespace-trimmed string, so the generator can be filtered (if t.startswith("$")) to pick out exactly the needed piece.

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, "html.parser")

tile  = soup.find("div", {"class": "tile"})
parts = [t for t in tile.stripped_strings]                  # each fragment, trimmed
price = [t for t in tile.stripped_strings if t.startswith("$")][0]   # isolate one

Markup this targets

<div class="tile">
  <h3>Laptop</h3>
  <span>$1,299</span>
  <small>in stock</small>
</div>

What each variable holds

parts -> ["Laptop", "$1,299", "in stock"]
price -> "$1,299"

_{Find by: beautifulsoup, bs4, stripped_strings, text nodes, fragments, multiple texts, generator, filter text, price, name and price, split element text · Source: WSA SQLi in-band}

HTTP Parsing