I gang

Break up Wikipedia pages into articles

PRELIMINARY NOTE: this project require parsing content from Wikipedia. Wikipedia is licensed under CC, so this is not only perfectly legal, but encouraged. There is a page on Wikipedia to give users advice on how to do exactly this, and we won't be scraping the website, but using a downloadable version that Wikipedia themselves provide.

---

We need to break up every page on Wikipedia into multiple articles.

For instance, this article: [url removed, login to view] is already divided into:

Contents [hide]

1 Etymology

2 History

2.1 Prehistory

2.2 Bronze Age

2.3 Iron Age

2.4 Migration period

2.5 Viking Age

2.6 Kalmar Union

2.7 Union with Denmark

2.8 Union with Sweden

2.9 Dissolution of the union

2.10 First and Second World Wars

2.11 Post-World War II history

3 Geography

3.1 Climate

3.2 Biodiversity

3.3 Environment

4 Politics and government

4.1 Administrative divisions

4.2 Judicial system and law enforcement

4.3 Foreign relations

4.4 Military

5 Health

6 Economy

6.1 Resources

6.1.1 Oil fields

6.2 Transport

7 Demographics

7.1 Migration

7.1.1 Emigration

7.1.2 Immigration

7.2 Religion

7.3 Largest cities of Norway

7.4 Education

7.5 Languages

8 Culture

8.1 Human rights

8.2 Religion

8.3 Cinema

8.4 Music

8.5 Literature

8.6 Research

8.7 Architecture

8.8 Art

8.9 Cuisine

8.10 Sports

9 International rankings

10 See also

11 Notes

12 References

13 Bibliography

On Wikipedia, the links point to an area of the page. Instead, we need to have the area of the page like a standalone article, so that we can import it as a module.

We need to generate—for each article extracted from the page—a JSON file or database entry with metadata like the page title and the category the page was filed under, and an array of the articles generated (including the article introduction, which is not under "Contents").

If opting for JSON files, we could have a folder with the articles saved into individual HTML files (for instance, "1 [url removed, login to view]", "2 [url removed, login to view]", "[url removed, login to view]" for the introduction).

We also need to generate a JSON file with the tree of all categories on Wikipedia.

Being CC, anyone can download Wikipedia.

It will be needed to parse the ZIM file with all articles. We will be using the Italian version (downloadable here: [url removed, login to view], file [url removed, login to view]). While the locale shouldn't matter, we will ultimately need to populate Imparato with contents from the Italian version.

The software should ideally run from the command line on Unix systems, something like:

zim-extract-categories --zim-file [url removed, login to view] --dest .

zim-extract-articles --zim-file [url removed, login to view] --dest . --category 22

Færdigheder: PHP, UNIX, WIKI, Wikipedia

Se mere: extract wikipedia individual articles, create wikipedia pages company, pages articles business, wikipedia pages needed, wikipedia master articles, write wikipedia pages, copy wikipedia pages, wikipedia parsers articles, creator wikipedia pages, copying wikipedia pages wiki, scrape wikipedia pages, convert html pages articles joomla, celebrity wikipedia pages, break table pages latex, articles 200 words sports, creators wikipedia pages

Om arbejdsgiveren:
( 0 bedømmelser ) Italy

Projekt-ID: #15237610

Tildelt til:

GeorgeGorbunov

Parsing WIKI Relevant Skills and Experience PHP/Javascript Proposed Milestones €30 EUR - Parsing

€30 EUR på 1 dag
(0 bedømmelser)
0.0

4 freelancere byder i gennemsnit €52 for dette job

arifrizvi

I am an experienced Wikipedia editor and I have published more than 100 pages. I have created pages for clients on [url removed, login to view] as well. You can visit my profile and have a look at their reviews. Relevant Skills and Mere

€150 EUR på 1 dag
(3 bedømmelser)
3.1
powerevolution

Hello, my name is Antonio and i'm a native italian translator, copywriter and e-commerce specialist. Competenze ed esperienze rilevanti Check my reviews to learn more about me and don't hesitate to contact me for any Mere

€8 EUR på 1 dag
(1 bedømmelse)
1.0
Christa7

A proposal has not yet been provided

€19 EUR på 1 dag
(0 bedømmelser)
0.0