I need this understanding for legally scaping Wikipedia/Wikimedia from their so-called pagecounts
dumps. While I am writing software, this is not a programming question. It is a question about the file format. I believe that my question does belong here but in case you believe it does not, please direct me to the right Stack Exchange site. Or to a mailing list or sommething similar.
The files are available here: https://dumps.wikimedia.org/other/pagecounts-ez/merged/
sample file that I looked at is: https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-01/pagecounts-2019-01-01.bz2
It is pretty big - around 400 Mb when archived. If you were to unarchive (un-bz2) it, you would see this comment at the top of the file:
# Wikimedia page request counts for 01/01/2019 (dd/mm/yyyy)## Each line shows 'project page daily-total hourly-counts'## Project is 'language-code project-code'## Project-code is## b:wikibooks,# k:wiktionary,# n:wikinews,# q:wikiquote,# s:wikisource,# v:wikiversity,# wo:wikivoyage,# z:wikipedia (z added by merge script: ...
For example the line 28 aa.b MediaWiki:Ipb_already_blocked 3 B1M1X1
is meant to convey the following (unless I made a mistake): language_code=aa, project_code=b:wikibooks, page=MediaWiki:Ipb_already_blocked, daily_total=3, 1-2AM=1, noon-1PM=1, 11PM-midnight=1
However, the above description of the file format is incomplete. I counted all the project codes in this file and in addition to b, k, n, q, s, v, wo and z
I encountered: m, m.d, d, wd, m.m, m.s, m.q, m.b, voy, w, m.v, y and zero
. Full list:
Counter({'z': 20230021, 'm': 18184241, 'm.d': 1123132, 'd': 1032908, 'wd': 667656, 'm.m': 468812, 's': 213008, 'm.s': 190193, 'm.q': 111030, 'b': 100118, 'q': 95519, 'm.b': 58849, 'n': 51576, 'voy': 36258, 'm.voy': 25983, 'w': 25506, 'v': 25045, 'm.v': 12886, 'm.n': 9851, 'y': 92, 'zero': 1})
I imagine that the m.
prefix stands for mobile and voy
stands for wikivoyage and perhaps zero
is to be ignored, but I still do not know what these stand for: m, d, wd, w, y
<= This is the crux of my question. I suspect that I would enncounter other project codes in other files but that is outside the scope of this question.