Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate URL shorteners #53

Open
chrisnewtn opened this issue Oct 14, 2013 · 0 comments
Open

Mitigate URL shorteners #53

chrisnewtn opened this issue Oct 14, 2013 · 0 comments

Comments

@chrisnewtn
Copy link
Member

So Twitter have started using their t.co url shortener on their profile links. At the moment Elsewhere is oblivious to this and any other shortener, it'll just treat the shortened url like it's the actual url. This is a problem.

What it basically means is that your website, instead of being example.com, is identified as t.co/24rkwdfj. Now when Elsewhere is validating links, it can't find any link to example.com, it can only see t.co/24rkwdfj and since nowhere else links to that, it won't treat it as being a valid resource.

In order to mitigate this we need to make Elsewhere aware of the fact that it's resolving redirects (I'm not exactly sure how they're handled at the moment).

The solution proposed is that we identify sites by their actual url i.e. the url that the shortener resolves to. We will still however keep track of the urls that are used as part of any redirects to the resolved url.

The end result, aside from the fixed urls, as a slight modification to each resource returned in the response.

{
  "results": [
    {
      "url": "http://chrisnewtn.com",
      "title": "Chris Newton",
      "favicon": "http://chrisnewtn.com/favicon.ico",
      "outboundLinks": {
          "verified": [ ... ],
          "unverified": [ ]
      },
      "inboundCount": {
        "verified": 4,
        "unverified": 0
      },
      "verified": true,
      // new bit
      "urlAliases": [
        "http://t.co/vV5BWNxil2"
      ],
    }
  ],
  "query": "http://chrisnewtn.com",
  "created": "2012-10-12T16:30:57.270Z",
  "crawled": 9,
  "verified": 9
}

This aliases property contains all the other urls used to identify the resource that Elsewhere has encountered, just in case it's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant