I’m almost certain something like this has been done before. Anyway, here’s the idea:
- Download the wikipedia database dump.
- Ingest article texts into a database
- Scrape wikipedia links out of the first paragraph of each article.
- Create a directed graph of articles where two articles share an edge if they are linked as described in (3). Treat article categories as node attributes.
- Investigate community structure of wikipedia articles, particularly which categories cluster together
- Extra challenge: Try to find articles that won’t “get you to philosophy”
There are currently over 4M articles in the english wikipedia, so for this to be feasible I will probably need to invent some criterion for including articles in the project, probably minimum length, minimum age, or minimum edits. Alternatively, I might just focus on certain categories/subcategories.