Looking at https://explore2.marginalia.nu/search?domain=simonwillison.n... now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.
Yeah, this is basically what I've been trying to show people for the last few years. There's still so much wild Internet out there if you go looking for it.
The current incarnation of diveintomark.org really doesn't belong in that list. The original went offline more than a decade ago.
Yes, explore2 is just a demo; an unfiltered listing of the output of this algorithm. For better or worse, it has no concept of dead links. If and when I product it it needs to hook into the search engine's link database better.
BTW, if anyone wants to dabble in this problem space, I make among other things the entire link graph available here: https://downloads.marginalia.nu/exports/
(hold my beer as I DDOS my own website by offering multi-gigabyte downloads on the HN front page ;-)
Why don't offer it only via BitTorrent?
Well I mean I could, but it's easier and more convenient to just put them in a directory on the server than go through all the rigmarole of creating a torrent.
Hey, that's an interesting idea for a side project: an easy way to serve a file via torrent. Maybe a commandline utility that uses WebSeeding so you don't need to run a torrent client? Hmm.
(I did a quick `apt search` to see if something like that is already available, but didn't find anything.)
Troy Hunt had the same story a while ago, might be helpful - https://www.troyhunt.com/how-i-got-pwned-by-my-cloud-costs/
I run outside of the cloud and have fixed costs so in that sense it's no prob :-)
In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).
This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.
Unfortunately I was a terrible software engineer back then and had much to learn about making a product.
The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.
Yes, that's how it's done.
A dense (bitmap) representation of matrix wouldn't fit in memory, would require about a PB of RAM unless my napkin math is off. The cardinality of this dataset is in the 100s of millions.
(An additional detail is I'm actually using a tiny fixed width bloom filter to make it go even faster)
I'm curious if you had a look at GraphBLAS (ala Redis Graph etc)?
Nope, did try a few off the shelf approaches, but ultimately figured it was easier to build something myself.
You seem to be missing that that's already how it's implemented.
I love the random page: https://search.marginalia.nu/explore/random
This makes me feel in the old open web again.
I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.
I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.
Result ranking takes a lot of variables, and factors like excessive tracking and affiliate links is one of them in my search engine.
You can poke around in the result valuation code here: https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...
Kagi does this (it is one of the main ranking signals).
It gives plausible results for websites similar to HN.
https://explore2.marginalia.nu/search?domain=news.ycombinato...
Aww, sadly nothing for my own websites!
This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!
It's not even really particularly new, technorati did this stuff 20 years ago.
What are the details? The sample page just 404s
I migrated the blog over to Hugo a while back and I think I lost the data. But no worry, the wayback machine's got a snapshot:
https://web.archive.org/web/20230217165734/https://www.margi...
There's details in the first linked See Also -post: https://www.marginalia.nu/log/69-creepy-website-similarity/
Discussion for the similarity post: https://news.ycombinator.com/item?id=34143101
Ah, thanks!
cosine similarity approach is better than PageRank
It's still fundamentally PageRank though, it just gets fed website similarities instead of links.
Not my opinion only a summary for parent who couldn't load page
The whole post could be this line!
To be fair they are two different metrics. Pagerabk measures how authoritative a page is. The cosine metric is for measuring how similar a page is to another one
Concludes that a certain www.example.com is 42% similar to example.com, when they are exactly the same: one redirects to the other.
The only thing different is the domain names, and those character strings themselves are more than 42% similar.
It's not comparing domain names or even content, but the similarity of the sets of domains that link to the given pairing. That's sort of the neat pair, how well this property does correlate with topic similarity.
OK, so certain domains X link to www.whatever.com. Certain domains Y link to whatever.com. The similarity between those is 42% in some metric (like what they link to?)
Yes, it's a cosine similarity between those sets.
I have searches the github repo for information for page ranking.
I am newbie in SEO. I would grately appreciate if marginalia provided clean readme about it, about their algorithm.
At marginalia search front page we have access to search keywords, page algorithm is important enough to be at least discussed on layman terms.
How to optimize page, so it could have a high ranking?
I undestand this could be in the code documentation, but I have not yet checked it, sorry.
The algorithm does not exist to be manipulated in that way. In fact, the article ends with "This new approach seems remarkably resistant to existing pagerank manipulation techniques". It is my opinion (and I know some people will disagree with me) that SEO is harmful and should not exist. Since you're still new to the industry, it might be worthwhile pivoting to a different occupation.
Hi, I am not familiar that much with page ranking, and terminology. I think that my oryginal questions could have been misunderstood.
I am writing my own web scraper. That is why I am in fact interested in this topic at all.
To distinguish poor pages from better I check HTML pages. I think all scrapers need to do that. I rank pages higher if they contain valid titles, og: fields, etc. Etc.
There is nothing wrong with checking it and asking for what can I do to make my site more scrap friendly.
Thanks,
If you think about it, a lot of people work in "manipulation techniques"
The way the ranking algorithm works is by comparing the similarity of the inbound links between websites.
So to manipulate the algorithm, you'd need to find an important website, and then find a way of making changes to all the websites that link to that website to add a link to your own website.
Not sure if serious? From the post:
> This new approach seems remarkably resistant to existing pagerank manipulation techniques
surprisingly effective to discover new mastodon or pleroma instances!
Explore sample data 404
Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."
This could be helpful in the short-term, but I'm skeptical long-term as it'll become just as gamed.
Isn't that Google's original algorithm?
PageRank is. This is a modification of PageRank. The original algorithm calculates the eigenvector of the link graph.
This algorithm uses the same method to calculate an eigenvector in an embedding space based on the similarity of the incident vectors of the link graph.
Hmmm, odd. I was under the impression they used cosine similarity based on page content. Once upon a time, based on that 'memory', I created a system to bin domain names into categories using cosine similarity. It worked surprising well.
Regardless, well done!
Hmm, seems like something that might be used for deduplication maybe?