• jart 2 years ago

    Looking at https://explore2.marginalia.nu/search?domain=simonwillison.n... now that's an interesting service. The web has felt isolating since it became commercialized. Bloggers are living in the Google dark ages right now. Having information like this be readily accessible could help us find each other and get the band back together. The open web can be reborn.

    • marginalia_nu 2 years ago

      Yeah, this is basically what I've been trying to show people for the last few years. There's still so much wild Internet out there if you go looking for it.

      • masfuerte 2 years ago

        The current incarnation of diveintomark.org really doesn't belong in that list. The original went offline more than a decade ago.

        • marginalia_nu 2 years ago

          Yes, explore2 is just a demo; an unfiltered listing of the output of this algorithm. For better or worse, it has no concept of dead links. If and when I product it it needs to hook into the search engine's link database better.

      • marginalia_nu 2 years ago

        BTW, if anyone wants to dabble in this problem space, I make among other things the entire link graph available here: https://downloads.marginalia.nu/exports/

        (hold my beer as I DDOS my own website by offering multi-gigabyte downloads on the HN front page ;-)

        • estebarb 2 years ago

          Why don't offer it only via BitTorrent?

          • marginalia_nu 2 years ago

            Well I mean I could, but it's easier and more convenient to just put them in a directory on the server than go through all the rigmarole of creating a torrent.

            • gary_0 2 years ago

              Hey, that's an interesting idea for a side project: an easy way to serve a file via torrent. Maybe a commandline utility that uses WebSeeding so you don't need to run a torrent client? Hmm.

              (I did a quick `apt search` to see if something like that is already available, but didn't find anything.)

          • tmcdos 2 years ago

            Troy Hunt had the same story a while ago, might be helpful - https://www.troyhunt.com/how-i-got-pwned-by-my-cloud-costs/

            • marginalia_nu 2 years ago

              I run outside of the cloud and have fixed costs so in that sense it's no prob :-)

          • robbomacrae 2 years ago

            In 2012 I was trying to turn my PhD thesis into a product for a better guitar tab and song lyrics search engine. The method was precisely this: use cosine similarity on the content itself (musical instructions parsed from the tabs or the tokens of the lyrics).

            This way I wasn't just able to get much better results searching than with PageRank but there was another benefit byproduct of this approach in that you could cluster the results and choose a distinct separate cluster for each subsequent result. With google you would not just get bad results at number 1 but results 1-20 would be near duplicates of just a few distinct efforts.

            Unfortunately I was a terrible software engineer back then and had much to learn about making a product.

            • janalsncm 2 years ago

              The author describes calculating cosine similarity of high dimensional vectors. If these are sparse binary vectors why not just store a list of nonzero indexes instead? That way your “similarity” is just the length of the intersection of the two sets of indexes. Maybe I’m missing something.

              • marginalia_nu 2 years ago

                Yes, that's how it's done.

                A dense (bitmap) representation of matrix wouldn't fit in memory, would require about a PB of RAM unless my napkin math is off. The cardinality of this dataset is in the 100s of millions.

                (An additional detail is I'm actually using a tiny fixed width bloom filter to make it go even faster)

                • socksy 2 years ago

                  I'm curious if you had a look at GraphBLAS (ala Redis Graph etc)?

                  • marginalia_nu 2 years ago

                    Nope, did try a few off the shelf approaches, but ultimately figured it was easier to build something myself.

                • azornathogron 2 years ago

                  You seem to be missing that that's already how it's implemented.

                  https://www.marginalia.nu/log/69-creepy-website-similarity/

                • jakearmitage 2 years ago

                  I love the random page: https://search.marginalia.nu/explore/random

                  This makes me feel in the old open web again.

                  • eek2121 2 years ago

                    I am surprised nobody has thought about looking into page content itself to help fight spam. If a blog has nothing except paid affiliate links (Amazon, etc.), ads, popups after page loads (news letter signups, etc) then it should probably be down ranked.

                    I have actually been developing something like that, but it does more, including down ranking certain categories of sites that contain unnecessary filler, such as some recipe sites.

                    • marginalia_nu 2 years ago

                      Result ranking takes a lot of variables, and factors like excessive tracking and affiliate links is one of them in my search engine.

                      You can poke around in the result valuation code here: https://github.com/MarginaliaSearch/MarginaliaSearch/blob/ma...

                      • freediver 2 years ago

                        Kagi does this (it is one of the main ranking signals).

                      • nemoniac 2 years ago

                        It gives plausible results for websites similar to HN.

                        https://explore2.marginalia.nu/search?domain=news.ycombinato...

                        • buildbot 2 years ago

                          Aww, sadly nothing for my own websites!

                          This is such a great idea, often when I find a small blog or site I want more of it! This is the perfect tool to discover that. It’s a clear and straightforward idea in retrospect, as all really great ideas tend to be!

                          • marginalia_nu 2 years ago

                            It's not even really particularly new, technorati did this stuff 20 years ago.

                          • solardev 2 years ago

                            What are the details? The sample page just 404s

                            • marginalia_nu 2 years ago

                              I migrated the blog over to Hugo a while back and I think I lost the data. But no worry, the wayback machine's got a snapshot:

                              https://web.archive.org/web/20230217165734/https://www.margi...

                              • bayesianbot 2 years ago

                                There's details in the first linked See Also -post: https://www.marginalia.nu/log/69-creepy-website-similarity/

                              • ipaddr 2 years ago

                                cosine similarity approach is better than PageRank

                                • marginalia_nu 2 years ago

                                  It's still fundamentally PageRank though, it just gets fed website similarities instead of links.

                                  • ipaddr 2 years ago

                                    Not my opinion only a summary for parent who couldn't load page

                                  • vasco 2 years ago

                                    The whole post could be this line!

                                    • altdataseller 2 years ago

                                      To be fair they are two different metrics. Pagerabk measures how authoritative a page is. The cosine metric is for measuring how similar a page is to another one

                                • undefined 2 years ago
                                  [deleted]
                                  • kazinator 2 years ago

                                    Concludes that a certain www.example.com is 42% similar to example.com, when they are exactly the same: one redirects to the other.

                                    The only thing different is the domain names, and those character strings themselves are more than 42% similar.

                                    • marginalia_nu 2 years ago

                                      It's not comparing domain names or even content, but the similarity of the sets of domains that link to the given pairing. That's sort of the neat pair, how well this property does correlate with topic similarity.

                                      • kazinator 2 years ago

                                        OK, so certain domains X link to www.whatever.com. Certain domains Y link to whatever.com. The similarity between those is 42% in some metric (like what they link to?)

                                        • marginalia_nu 2 years ago

                                          Yes, it's a cosine similarity between those sets.

                                    • renegat0x0 2 years ago

                                      I have searches the github repo for information for page ranking.

                                      I am newbie in SEO. I would grately appreciate if marginalia provided clean readme about it, about their algorithm.

                                      At marginalia search front page we have access to search keywords, page algorithm is important enough to be at least discussed on layman terms.

                                      How to optimize page, so it could have a high ranking?

                                      I undestand this could be in the code documentation, but I have not yet checked it, sorry.

                                      • hliyan 2 years ago

                                        The algorithm does not exist to be manipulated in that way. In fact, the article ends with "This new approach seems remarkably resistant to existing pagerank manipulation techniques". It is my opinion (and I know some people will disagree with me) that SEO is harmful and should not exist. Since you're still new to the industry, it might be worthwhile pivoting to a different occupation.

                                        • renegat0x0 2 years ago

                                          Hi, I am not familiar that much with page ranking, and terminology. I think that my oryginal questions could have been misunderstood.

                                          I am writing my own web scraper. That is why I am in fact interested in this topic at all.

                                          To distinguish poor pages from better I check HTML pages. I think all scrapers need to do that. I rank pages higher if they contain valid titles, og: fields, etc. Etc.

                                          There is nothing wrong with checking it and asking for what can I do to make my site more scrap friendly.

                                          Thanks,

                                          • is_true 2 years ago

                                            If you think about it, a lot of people work in "manipulation techniques"

                                          • marginalia_nu 2 years ago

                                            The way the ranking algorithm works is by comparing the similarity of the inbound links between websites.

                                            So to manipulate the algorithm, you'd need to find an important website, and then find a way of making changes to all the websites that link to that website to add a link to your own website.

                                            • dleeftink 2 years ago

                                              Not sure if serious? From the post:

                                              > This new approach seems remarkably resistant to existing pagerank manipulation techniques

                                            • derelicta 2 years ago

                                              surprisingly effective to discover new mastodon or pleroma instances!

                                              • kgbcia 2 years ago

                                                Explore sample data 404

                                              • lowkey_ 2 years ago

                                                Goodhart's law: "When a measure becomes a target, it ceases to be a good measure."

                                                This could be helpful in the short-term, but I'm skeptical long-term as it'll become just as gamed.

                                                • undefined 2 years ago
                                                  [deleted]
                                                • zanethomas 2 years ago

                                                  Isn't that Google's original algorithm?

                                                  • marginalia_nu 2 years ago

                                                    PageRank is. This is a modification of PageRank. The original algorithm calculates the eigenvector of the link graph.

                                                    This algorithm uses the same method to calculate an eigenvector in an embedding space based on the similarity of the incident vectors of the link graph.

                                                    • zanethomas 2 years ago

                                                      Hmmm, odd. I was under the impression they used cosine similarity based on page content. Once upon a time, based on that 'memory', I created a system to bin domain names into categories using cosine similarity. It worked surprising well.

                                                      Regardless, well done!

                                                      • marginalia_nu 2 years ago

                                                        Hmm, seems like something that might be used for deduplication maybe?