Comments Page - A new approach to domain ranking

« Back A new approach to domain rankingmarginalia.nuSubmitted by luu 2 years ago

jart 2 years ago
Looking at https://explore2.marginalia.nu/search?domain=simonwillison.n... now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.
- marginalia_nu 2 years ago
  Yeah, this is basically what I've been trying to show people for the last few years. There's still so much wild Internet out there if you go looking for it.
- masfuerte 2 years ago
  The current incarnation of diveintomark.org really doesn't belong in that list. The original went offline more than a decade ago.
  marginalia_nu 2 years ago
  Yes, explore2 is just a demo; an unfiltered listing of the output of this algorithm. For better or worse, it has no concept of dead links. If and when I product it it needs to hook into the search engine's link database better.
marginalia_nu 2 years ago
BTW, if anyone wants to dabble in this problem space, I make among other things the entire link graph available here: https://downloads.marginalia.nu/exports/
(hold my beer as I DDOS my own website by offering multi-gigabyte downloads on the HN front page ;-)
- estebarb 2 years ago
  Why don't offer it only via BitTorrent?
  marginalia_nu 2 years ago
  Well I mean I could, but it's easier and more convenient to just put them in a directory on the server than go through all the rigmarole of creating a torrent.
  gary_0 2 years ago
  Hey, that's an interesting idea for a side project: an easy way to serve a file via torrent. Maybe a commandline utility that uses WebSeeding so you don't need to run a torrent client? Hmm.
  (I did a quick `apt search` to see if something like that is already available, but didn't find anything.)
- tmcdos 2 years ago
  Troy Hunt had the same story a while ago, might be helpful - https://www.troyhunt.com/how-i-got-pwned-by-my-cloud-costs/
  marginalia_nu 2 years ago
  I run outside of the cloud and have fixed costs so in that sense it's no prob :-)
robbomacrae 2 years ago
In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).
This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.
Unfortunately I was a terrible software engineer back then and had much to learn about making a product.
janalsncm 2 years ago
The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.
- marginalia_nu 2 years ago
  Yes, that's how it's done.
  A dense (bitmap) representation of matrix wouldn't fit in memory, would require about a PB of RAM unless my napkin math is off. The cardinality of this dataset is in the 100s of millions.
  (An additional detail is I'm actually using a tiny fixed width bloom filter to make it go even faster)
  socksy 2 years ago
  I'm curious if you had a look at GraphBLAS (ala Redis Graph etc)?
  marginalia_nu 2 years ago
  Nope, did try a few off the shelf approaches, but ultimately figured it was easier to build something myself.
- azornathogron 2 years ago
  You seem to be missing that that's already how it's implemented.
  https://www.marginalia.nu/log/69-creepy-website-similarity/
jakearmitage 2 years ago
I love the random page: https://search.marginalia.nu/explore/random
This makes me feel in the old open web again.
eek2121 2 years ago
I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.
I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.
- marginalia_nu 2 years ago
  Result ranking takes a lot of variables, and factors like excessive tracking and affiliate links is one of them in my search engine.
  You can poke around in the result valuation code here: https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...
- freediver 2 years ago
  Kagi does this (it is one of the main ranking signals).
nemoniac 2 years ago
It gives plausible results for websites similar to HN.
https://explore2.marginalia.nu/search?domain=news.ycombinato...
buildbot 2 years ago
Aww, sadly nothing for my own websites!
This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!
- marginalia_nu 2 years ago
  It's not even really particularly new, technorati did this stuff 20 years ago.
solardev 2 years ago
What are the details? The sample page just 404s
- marginalia_nu 2 years ago
  I migrated the blog over to Hugo a while back and I think I lost the data. But no worry, the wayback machine's got a snapshot:
  https://web.archive.org/web/20230217165734/https://www.margi...
- bayesianbot 2 years ago
  There's details in the first linked See Also -post: https://www.marginalia.nu/log/69-creepy-website-similarity/
  asicsp 2 years ago
  Discussion for the similarity post: https://news.ycombinator.com/item?id=34143101
  solardev 2 years ago
  Ah, thanks!
- ipaddr 2 years ago
  cosine similarity approach is better than PageRank
  marginalia_nu 2 years ago
  It's still fundamentally PageRank though, it just gets fed website similarities instead of links.
  ipaddr 2 years ago
  Not my opinion only a summary for parent who couldn't load page
  vasco 2 years ago
  The whole post could be this line!
  altdataseller 2 years ago
  To be fair they are two different metrics. Pagerabk measures how authoritative a page is. The cosine metric is for measuring how similar a page is to another one
undefined 2 years ago
[deleted]
kazinator 2 years ago
Concludes that a certain www.example.com is 42% similar to example.com, when they are exactly the same: one redirects to the other.
The only thing different is the domain names, and those character strings themselves are more than 42% similar.
- marginalia_nu 2 years ago
  It's not comparing domain names or even content, but the similarity of the sets of domains that link to the given pairing. That's sort of the neat pair, how well this property does correlate with topic similarity.
  kazinator 2 years ago
  OK, so certain domains X link to www.whatever.com. Certain domains Y link to whatever.com. The similarity between those is 42% in some metric (like what they link to?)
  marginalia_nu 2 years ago
  Yes, it's a cosine similarity between those sets.
renegat0x0 2 years ago
I have searches the github repo for information for page ranking.
I am newbie in SEO. I would grately appreciate if marginalia provided clean readme about it, about their algorithm.
At marginalia search front page we have access to search keywords, page algorithm is important enough to be at least discussed on layman terms.
How to optimize page, so it could have a high ranking?
I undestand this could be in the code documentation, but I have not yet checked it, sorry.
- hliyan 2 years ago
  The algorithm does not exist to be manipulated in that way. In fact, the article ends with "This new approach seems remarkably resistant to existing pagerank manipulation techniques". It is my opinion (and I know some people will disagree with me) that SEO is harmful and should not exist. Since you're still new to the industry, it might be worthwhile pivoting to a different occupation.
  renegat0x0 2 years ago
  Hi, I am not familiar that much with page ranking, and terminology. I think that my oryginal questions could have been misunderstood.
  I am writing my own web scraper. That is why I am in fact interested in this topic at all.
  To distinguish poor pages from better I check HTML pages. I think all scrapers need to do that. I rank pages higher if they contain valid titles, og: fields, etc. Etc.
  There is nothing wrong with checking it and asking for what can I do to make my site more scrap friendly.
  Thanks,
  is_true 2 years ago
  If you think about it, a lot of people work in "manipulation techniques"
- marginalia_nu 2 years ago
  The way the ranking algorithm works is by comparing the similarity of the inbound links between websites.
  So to manipulate the algorithm, you'd need to find an important website, and then find a way of making changes to all the websites that link to that website to add a link to your own website.
- dleeftink 2 years ago
  Not sure if serious? From the post:
  > This new approach seems remarkably resistant to existing pagerank manipulation techniques
derelicta 2 years ago
surprisingly effective to discover new mastodon or pleroma instances!
kgbcia 2 years ago
Explore sample data 404
- marginalia_nu 2 years ago
  https://web.archive.org/web/20230217165734/https://www.margi...
lowkey_ 2 years ago
Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."
This could be helpful in the short-term, but I'm skeptical long-term as it'll become just as gamed.
- undefined 2 years ago
  [deleted]
zanethomas 2 years ago
Isn't that Google's original algorithm?
- marginalia_nu 2 years ago
  PageRank is. This is a modification of PageRank. The original algorithm calculates the eigenvector of the link graph.
  This algorithm uses the same method to calculate an eigenvector in an embedding space based on the similarity of the incident vectors of the link graph.
  zanethomas 2 years ago
  Hmmm, odd. I was under the impression they used cosine similarity based on page content. Once upon a time, based on that 'memory', I created a system to bin domain names into categories using cosine similarity. It worked surprising well.
  Regardless, well done!
  marginalia_nu 2 years ago
  Hmm, seems like something that might be used for deduplication maybe?