Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler reliability issues #654

Open
hiiamboris opened this issue Aug 16, 2024 · 0 comments
Open

Crawler reliability issues #654

hiiamboris opened this issue Aug 16, 2024 · 0 comments
Labels
bug Indicates an unexpected problem or unintended behavior crawler

Comments

@hiiamboris
Copy link

Hi!

I'm trying to index all pages by mask https://github.com/red/red/issues/\d+. Expected to be 4002 pages.
How hard can it be?

What I've tried:

  1. Running advanced crawler with depth limit 99 (doesn't allow more), hoping that it will jump from page to page until it loads all the pages.
    Start page: https://github.com/red/red/issues?page=1&q=is%3Aissue
    Filter: https://github.com/red/REP/issues(\?page=\d+&q=is%3Aissue|/\d+)? to exclude the irrelevant pages
    In the end it indexes an arbitrary number of pages, 500-2000.
  2. Running the same from a few starting points at once:
    Start-1: https://github.com/red/red/issues?page=1&q=is%3Aissue
    Start-2: https://github.com/red/red/issues?page=75&q=is%3Aissue
    Start-3: https://github.com/red/red/issues?page=150&q=is%3Aissue
    Same result, maybe in the range 1000-2000.
    The issue seems to be in that sometimes Github fails to return the page, and the whole crawl sequence breaks. Re-loading the failed pages obviously won't let me crawl them (so whole pages of issues are lost).
  3. Listing all links of the form https://github.com/red/red/issues/\d+ from 1 to 5535, running advanced crawler on this link list with depth=0. This must be bulletproof, right? Wrong...
    In result I get around 100-200 pages only. Most of the pages seem to show link, detected from context error in Index Browser (whatever that means!?), even though these are correct issue webpages.
    Some (<30%) of the links do redirect to /pulls and /discussions subpaths, but get indexed despite my efforts to avoid that by setting filtering to subpath or to https://github.com/red/red/issues/\d+ mask. Filter just gets ignored.
    Typical log example: indexing-log-github.pdf

Another issue I notice is that whenever I start a new crawl, the previously crawled pages (in the same subpath) start to gradually disappear from the index, despite all settings being set to "don't delete anything".

I appreciate fixes to these issues and instructions on what can I do to work around them.

Using a docker install, default settings, robinson mode. yacy_v1.940_202407241507_d181b9e89

@okybaca okybaca added bug Indicates an unexpected problem or unintended behavior crawler labels Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior crawler
Projects
None yet
Development

No branches or pull requests

2 participants