Comments Page - Show HN: I made a website to semantically search ArXiv papers

« Back Show HN: I made a website to semantically search ArXiv paperspapermatch.mitanshu.techSubmitted by Quizzical4230 a day ago

shishy 19 hours ago
I enjoy seeing projects like this!
If you expand beyond arxiv, keep in mind since coverage matters for lit reviews, unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.
Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?
You might consider what else a dedicated product workflow for lit reviews includes besides search
(used to work at scite.ai)
- Quizzical4230 16 hours ago
  Thank you for the appreciation and great feedback!
  | If you expand beyond arxiv, keep in mind since coverage matters for lit reviews,
  I do have PaperMatchBio [^1] for bioRxiv and PaperMatchMed [^2] for medRxiv, however I do agree having multiple sites for domains isn't ideal. And I am yet to create a synchronization pipeline for these two so the results may be a little stale.
  | unfortunately the big publishers (Elsevier and Springer) are forcing other indices like OpenAlex, etc. to remove abstracts so they're harder to get.
  This sounds like a real issue in expanding the coverage.
  | Have you checked out other tools like undermind.ai, scite.ai, and elicit.org?
  I did, but maybe not thoroughly enough. I will check these and add complementing features.
  | You might consider what else a dedicated product workflow for lit reviews includes besides search
  Do you mean a reference management system like Mendeley/Zotero?
  [1]: https://papermatchbio.mitanshu.tech/ [2]: https://papermatchmed.mitanshu.tech/
  eric-burel 15 hours ago
  Unusual use case but I write literature reviews for French R&D tax cut system, and we specifically need to: focus on most recent papers, stay on topic for a very specific problematic a company has, potentially include grey literature (tech blog articles from renowned corp), be as exhaustive as possible when it comes to freely accessible papers (we are more ok with missing paid papers unless they are really popular). A "dedicated product workflow" could be about taking business use cases like that into account. This is a real business problem, the Google Scholar lock up is annoying and I would pay for something better than what exists.
  dbmikus 5 hours ago
  Hey, I'm not OP, but I'm working on what seems to be the exact problem you mentioned. We (https://fixpoint.co/) search and monitor web data about companies. We are indexing patents and academic papers right now, plus we can scrape and monitor just about any website (some social media sites not supported).
  We have users with very similar use cases to yours. Want to email me? dylan@fixpoint.co. I'm one of the founders :)
  Quizzical4230 15 hours ago
  This is quite unique. I believe a custom solution might help you better than Google Scholar.
  eric-burel 10 hours ago
  This can be seen as technology watch, as opposed to a thesis literature review for instance. Google Scholar gives the best results but sadly doesn't really want you to build products on top of it : no api, no scraping. Breaking this monopoly would be a huge step forward, especially when coupled with semantic search.
  mattigames 6 hours ago
  "|" it's a terrible character for signaling quotes, as it looks a bit too much like "I" or "l" and sometimes even "1" or "i" depending on the font used. I believe the greater-than symbol (>) is better suited for this task.
  Quizzical4230 an hour ago
  So true ;-; I was following the Gmail protocol. I will use > from now on. Happy Holidays :D
swyx 11 hours ago
1. why mixbread's model?
2. how much efficiency gain did you see binarising embeddings/using hamming distance?
3. why milvus over other vector stores?
4. did you automate the weekly metadata pull? just a simple cron job? anything else you need orchestrated?
user thoughts on searching for "transformers on byte level not token level" - was good but didnt turn up https://arxiv.org/abs/2412.09871 <- which is more recent, more people might want
also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.
- Quizzical4230 2 hours ago
  1. The model size was small enough to process the corpus fast-ish using the limited resources I have. They also support MRL and binary embeddings which help would be helpful in case I need to downsize on the VM size.
  2. Close to 500ms. See [^1].
  3. This [^2] was the reason I went with milvus. I also assumed that more stars would result in a bigger community and hence faster bug discovery and fixes. And better feature support.
  4. Yes, I automated the weekly pull here [^3]. Since I am constrained on resources available, I used HuggingFace Spaces to do the automation for me :) Although, the space keeps sleeping and to avoid that, I am planning keep calling the same space using api/gradio_client. Let's see how that goes.
  | which is more recent, more people might want
  Absolutely agree. I am planning to add a 'Recency' sorting option for the same. It should balance between similarity and the date published.
  | also you might want more result density - so perhaps a UI option to collapse the abstracts and display more in the first glance.
  Oh, I will surely look into it. Thank you so much for a detailed response. :D
  [1]: https://news.ycombinator.com/item?id=42507116#42509636 [2]: https://benchmark.vectorview.ai/vectordbs.html [3]: https://huggingface.co/spaces/bluuebunny/update_arxiv_embedd...
  swyx 19 minutes ago
  my pleasure, thank you for the reply! ive never used milvus or heard of mixbread so this was refreshing.
fasa99 8 hours ago
For what it's worth, back in the day (a few years ago, before the LLM boom a few years) I found on a similar sized vector database (gensim / doc2vec), it's possible to just brute force a vector search e.g. with SSE or AVX type instructions. You can code it in C and have a python API. Your data appears to be a few gigs so that's feasible for realtime CPU brute force, <200 ms
- Quizzical4230 an hour ago
  This is an interesting problem to tackle. Added to TODO list! :D
omarhaneef 15 hours ago
For every application of semantic search, I’d love to see what the benefit is over text search. If there a benchmark to see if it improves the search. Subjectively, did you find it surfaced new papers? Is this more useful in certain domains?
- Quizzical4230 15 hours ago
  All benefits depend on the ability of the embedding model. Semantic embeddings understand nuances, so they can match abstracts that align conceptually even if no exact keywords overlap. For example, "neural networks" vs. "deep learning." can and should fetch similar papers.
  Subjectively, yes. I sent this around my peers and they said it helped them find new authors/papers in the field while preparing their manuscripts.
  | Is this more useful in certain domains?
  I don't think I have the capacity to comment on this.
- feznyng 11 hours ago
  One of the factors is how users phrase their queries. On some level people are used to full text search but semantic shines when they ask literal questions with terminology that may not match the answer.
  Quizzical4230 an hour ago
  Exactly. Full text paradigm has it's own pros and I believe we need those tools in the new vector search to take full advantage. I am planning to add keywords feature where if a user enters something in "quotes", the would need to be in the shown results. Just like you can do with a google search.
  woodson 9 hours ago
  Query keyword expansion works quite well for that without semantic search (although it can reduce precision).
namanyayg 16 hours ago
What are other good areas where semantic search can be useful? I've been toying with the idea for a while to play around and make such a webapp.
Some of the current ideas I had:
1. Online ads search for marketers: embed and index video + image ads, allow natural language search to find marketing inspiration. 2. Multi e-commerce platform search for shopping: find products across Sephora, zara, h&m, etc.
I don't know if either are good enough business problems worth solving tho.
- bubaumba 16 hours ago
  3. Quick lookup into internal documents. Almost any company needs it. Navigating file-system like hierarchy is slow and limited. That was old way.
  4. Quick lookup into the code to find relevant parts even when the wording in comments is different.
  imadethis 13 hours ago
  For 4, it would be neat to first pass each block of code (function or class or whatever) through an llm to extract meaning, and then embed some combination of llm parsed meaning, docstring and comments, and function name. Then do semantic search against that.
  That way you’d cover what the human thinks the block is for vs what an LLM “thinks” it’s for. Should cover some amount of drift in names and comments that any codebase sees.
- jondwillis 10 hours ago
  Please stop making ad tech better. Someone else might, but you don’t have to.
shigeru94 21 hours ago
Is this similar to https://www.semanticscholar.org (from Allen Institute for AI) ?
- triilman 19 hours ago
  I think more like this website https://arxivxplorer.com/
- Quizzical4230 16 hours ago
  It is more like what triilman commented, but with all components open-source. I plan to add filters soon enough with keywords support! (actually waiting for milvus)
Maro 12 hours ago
Very cool!
Add a "similar papers" link to each paper, that will make this the obvious way to discover topics by clicking along the similar papers.
- Quizzical4230 an hour ago
  Amazing! I will do so :D
mskar 14 hours ago
This is awesome! If you’re interested, you could add a search tool client for your backend in paper-qa (https://github.com/Future-House/paper-qa). Then paper-qa users would be able to use your semantic search as part of its workflow.
- Quizzical4230 14 hours ago
  paper-qa looks pretty cool. I will do so!
mrjay42 17 hours ago
I think you have an encoding problem <3
If you search for "UPC high performance computing evaluation", you'll see paper with buggy characters in the authors name (second results with that search).
- Quizzical4230 16 hours ago
  Most definitely. Thank you for pointing this out!
amelius 5 hours ago
Great procrastination project :)
- Quizzical4230 an hour ago
  hey hey hey! XD
bubaumba 16 hours ago
This is cool, but how about local semantic search through tens of thousands articles and books. Sure I'm not the first, there should be some tools already.
- ttpphd 6 hours ago
  Try Semantra https://github.com/freedmand/semantra
- Quizzical4230 15 hours ago
  I definitely was thinking about something like this for PaperMatch itself. Where anyone can pull a docker image and search through the articles locally! Do you think this idea is worthwhile pursuing?
  bubaumba 14 hours ago
  Absolutely worth doing. Here is interesting related video, local RAG:
  https://www.youtube.com/watch?v=bq1Plo2RhYI
  I'm not an expert, but I'll do it for learning. Then open source if it works. As far as I understand this approach requires a vector database and LLM which doesn't have to be big. Technically it can be implemented as local web server. Should be easy to use, just type and get a sorted by relevance list.
  Quizzical4230 14 hours ago
  Perfect!
  Although, atm I am only using retrieval without any LLM involved. Might try integrating if it significantly improves UX without compromising speeds.
tokai 15 hours ago
Nice but I have to point out that a systematic review cannot be done with semantic search and should never be done in a preprint collection.
- dmezzetti 15 hours ago
  Why?
  Quizzical4230 15 hours ago
  Not sure about the semantic search, but preprints are peer reviewed and hence not vetted. However, at the current pace of papers on arXiv (5k+/week) peer review alone might halt the progress.
  dmezzetti 15 hours ago
  Why not semantic search was the bigger question.
- Quizzical4230 15 hours ago
  Agreed.
maCDzP 9 hours ago
I want to crawl and plug in scihib to this and see what happens.
lgas 20 hours ago
This might've saved you some time: https://huggingface.co/NeuML/txtai-arxiv
- cluckindan 17 hours ago
  The dataset there is almost a year old.
  dmezzetti 16 hours ago
  It was just updated last week. The dataset page on HF only has the scripts, the raw data resides over on Kaggle.
- Quizzical4230 15 hours ago
  Actually, yeah XD
andai 14 hours ago
Did you notice a difference in performance after binarization? Do you have a way to measure performance?
- Quizzical4230 14 hours ago
  Absolutely!
  Here is a graph showing the difference. [^1]
  Known ID is arXiv ID that is in the vector database, Unknown IDs need the metadata to be fetched via API. Text is embedded via the model's API.
  FLAT and IVF_FLAT are different indexes used for the search. [^2]
  [1]: https://raw.githubusercontent.com/mitanshu7/dumpyard/refs/he...
  [2]: https://zilliz.com/learn/how-to-pick-a-vector-index-in-milvu...
  binarymax 14 hours ago
  That looks great for speed, but what about recall?
  Quizzical4230 13 hours ago
  That's has a major downgrade. For binary embeddings, the top 10 results are same as fp32, albeit shuffled. However after the 10th result, I think quality degrades quite a bit. I was planning to add a reranking strategy for binary embeddings. What do you think?
  intalentive 12 hours ago
  Recommend reranking. You basically get full resolution performance for a negligible latency hit. (Unless you need to make two network calls…)
  MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.
  Quizzical4230 an hour ago
  > Recommend reranking.
  Will explore it thoroughly then!
  > MixedBread supports matryoshka embeddings too so that’s another option to explore on the latency-recall curve.
  Yes, exactly why I went with this model!
antman 17 hours ago
Nice work. Any other technical comments, why did you use those embeddings, did you binarzue them, did you use any dpecial prompts?
- Quizzical4230 15 hours ago
  At the beginning of the project, MixedBread's embedding model was small and leading the MTEB leaderboard [^1], hence I went with it.
  Yes, I did binarize them for a faster search experience. However, I think the search quality degrades significantly after the first 10 results, which are same as fp32 search but with a shuffled order. I am planning to add a reranking strategy to boost better results upwards.
  At the moment, this is plain search with no special prompts.
  [1]: https://huggingface.co/spaces/mteb/leaderboard
venice_benice 5 hours ago
interesting project; I’m not really sure how useful it is for field-specific stuff—I'm searching for “image reduction astronomy”, and it shows all sorts of related but not image-reduction work (including noise reduction which is not the same thing). I’m not really familiar with vector search enough to evaluate it well enough.
However I can give you the heads-up that the abstracts don't render well because (La)TeX is interpreted as markdown so that
```
    Paper~1 shows something and Paper~2 shows something else
```
will strikethrough the text between the tildes (whereas they are meant to be non-breaking spaces). Similarly for the backtick which makes text monospaced in the rendered output but is simply supposed to be the opening quote.
- Quizzical4230 an hour ago
  Yes, I think vector search is tricky to navigate at times since now the onus is on the user to explain the problem well. However, you can copy paste full abstracts to get similar papers well enough.
  I will fix the LaTeX rendering ASAP.
  Thank you for trying out the site! Happy Holidays :D
madbutcode 14 hours ago
This looks great! I have used the biorXiv version of papermatch and it gives pretty good results!
- Quizzical4230 3 hours ago
  Thank you for your kind words!
gaborme 15 hours ago
Nice. Why not use a full-text search like self-hosted Typesense?
- Quizzical4230 15 hours ago
  Full text search would be redundant as arXiv.org already supports it. For semantic search, Typesense has limited collection of embedding models. [^1]
  [1]: https://huggingface.co/typesense/models/tree/main
dmezzetti 16 hours ago
Excellent project.
As mentioned in another comment, I've put together an embeddings database using the arxiv dataset (https://huggingface.co/NeuML/txtai-arxiv) recently.
For those interested in the literature search space, a couple other projects I've worked on that may be of interest.
annotateai (https://github.com/neuml/annotateai) - Annotates papers with LLMs. Supports searching the arxiv database mentioned above.
paperai (https://github.com/neuml/paperai) - Semantic search and workflows for medical/scientific papers. Built on txtai (https://github.com/neuml/txtai)
paperetl (https://github.com/neuml/paperetl) - ETL processes for medical and scientific papers. Supports full PDF docs.
- Quizzical4230 15 hours ago
  Thank you for your kind words.
  These look like great projects, I will surely check them out :D
- shishy 16 hours ago
  paperetl is cool, saving that for later, nice! did something similar in-house with grobid in the past (great project by patrice).
  dmezzetti 16 hours ago
  Grobid is great. paperetl is the workhorse of the projects mentioned above. Good ole programming and multiprocessing to churn through data.
ukuina 16 hours ago
Related: emergentmind.com
- Quizzical4230 15 hours ago
  Thank you for the link. Would you know any reliable small model to add on top of vanilla search for a similar experience?