Quantcast
Channel: Hot Weekly Questions - Web Applications Stack Exchange
Viewing all articles
Browse latest Browse all 9782

Finding all pages under a given website

$
0
0

For a given website, I'm looking to find all pages under that website. For example, given "maryland.gov", I'm effectively looking for everything of the form "*.maryland.gov/*" (including "maryland.gov/*" pages). I'm following the outline from https://archive.org/post/1055220/how-to-query-for-all-the-websites-that-end-in-combr, but I've notice some pages occasionally appearing, but not always, depending on whether or not I use pagination.

If I conduct a search with http://web.archive.org/cdx/search/cdx?url=maryland.gov&output=json&from=2010&matchType=domain&collapse=urlkey&filter=statuscode:200, then the page https://roads.maryland.gov/oppen/signal_systems17.pdf is caught, while the page https://abetter.maryland.gov/plan/documents/a-better-maryland-plan.pdf is not.

In contrast, if I use a search with pagination, such as http://web.archive.org/cdx/search/cdx?url=maryland.gov&output=json&from=2010&matchType=domain&collapse=urlkey&filter=statuscode:200&pageSize=1&page=0 (and then loop over the page field), then the reverse situation occurs where https://abetter.maryland.gov/plan/documents/a-better-maryland-plan.pdf is found, but https://roads.maryland.gov/oppen/signal_systems17.pdf is not.

So, one workaround that I considered was doing an initial search to find the subdomains, and then do a "matchType=prefix" search of each subdomain. However, the captures returned vary. For example, if I am interested in the captures of https://governor.maryland.gov/wp-content/uploads/2015/10/Minutes-Towson-Finale.pdf (based on https://web.archive.org/web/20220000000000*/governor.maryland.gov/wp-content/uploads/2015/10/Minutes-Towson-Finale.pdf, there should be 53 matches), and I search with https://web.archive.org/cdx/search/cdx?url=governor.maryland.gov&output=json&matchType=prefix&from=2010, no matches are found. However, a search of https://web.archive.org/cdx/search/cdx?url=governor.maryland.gov/wp-content&output=json&matchType=prefix&from=2010 finds all 53 captures.

One of my questions is why would these differences occur? Also, I imagine that finding all paths and then doing a "matchType=prefix" search would eventually find all pages, but is there a more efficient means to get all of the pages?


Viewing all articles
Browse latest Browse all 9782

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>