crawler

This crawler requires a LAMP installation.
In order to run the crawler, first create a MySQL database and add the credentials in functions.php
A database export file is included in the file crawler.sql and can be imported directly into MySQL.
Decide which node will be the master (preferably the first one in which the code is downloaded), and modify the variable $master in Crawler.php, to the IP of the node.
The crawler can be limited to a local network (in this case, the IP will be private), or the internet, in which case it will be public.
Run the crawler by executing the command:

php /path/to/Crawler.php

Notes: There are other settings that can be configured, and all are located in the first 13 lines of the Crawler.php. Read the comments.

When the master node runs for the first time, it will clear the url frontier table and fill it with the urls in the seed.txt file.

After the master node runs for the first time, it will rename the seed.txt to seed.old so that when the crawl job stops, it won't pull the same urls again.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Crawler.php		Crawler.php
Heap.php		Heap.php
README.md		README.md
crawl.log		crawl.log
crawler.sql		crawler.sql
functions.php		functions.php
pagerank.php		pagerank.php
seed.txt		seed.txt
simple_html_dom.php		simple_html_dom.php

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawler

About

Releases

Packages

Languages

figassis/crawler

Folders and files

Latest commit

History

Repository files navigation

crawler

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages