Comments Page - Nixiesearch: Running Lucene over S3, and why we're building a new search engine

« Back Nixiesearch: Running Lucene over S3, and why we're building a new search enginenixiesearch.substack.comSubmitted by shutty 6 hours ago

oersted 3 hours ago
Check out Quickwit, it is briefly mentioned but I think mistakenly dismissed. They have been working on a similar concept for a few years and the results are excellent. It’s in no way mainly for logs as they claim, it is a general purpose cloud native search engine like the one they suggest, very well engineered.
It is based on Tantivy, a Lucene alternative in Rust. I have extensive hands on experience with both and I highly recommend Tantivy, it’s just superior in every way now, such a pleasure to use, an ideal example of what Rust was designed for.
- erk__ 2 hours ago
  I have been using Tantivy for Garfield comic search for a few years now, it has been really nice to use in all that time.
- Semaphor 2 hours ago
  > It’s in no way mainly for logs as they claim
  Where can I find more information on using it for user-facing search? The repository [0] starts with "Cloud-native search engine for observability (logs, traces, and soon metrics!)" and keeps talking about those.
  [0]: https://github.com/quickwit-oss/quickwit
  oersted 2 hours ago
  That just seems to be the market where search engines have the most obvious business case, Elasticsearch positioned themselves in the same way. But both are general-purpose full-text search engines perfectly capable of any serious search use-case.
  Their original breakout demo was on Common Crawl: https://common-crawl.quickwit.io/
  But thanks for pointing it out, I hadn't looked at it in a few months, it looks like they significantly changed their pitch in the last year. I assume they got VC money and they need to deliver now.
  AsianOtter 14 minutes ago
  But the demo does not work.
  I tried "England is" and a few similar queries. It spends three seconds then shows that nothing is found.
  hovering_nox 2 hours ago
  I would say here: Features
  https://quickwit.io/docs/overview/introduction#key-features
- bomewish 2 hours ago
  The big issue with tantivy I've found is that it only deals with immutable data. So it can't be used for anything you want to do CRUD on. This rules out a LOT of use cases. It's a real shame imo.
  pentlander 2 hours ago
  I’m pretty sure that Lucene is exactly the same, the segments it creates are immutable and Elastic is what handles a “mutable” view of the data. Which makes sense because Tantivy is like Lucene, not ES.
  https://lucene.apache.org/core/7_7_0/core/org/apache/lucene/...
  oersted 2 hours ago
  It is indeed mostly designed for bulk indexing and static search. But it is not a strict limitation, frequent small inserts and updates are performant too. Deleting can be a bit awkward, you can only delete every document with a given term in a field, but if you use it on a unique id field it's just like a normal delete.
  Tantivy is a low-level library to build your own search engine (Quickwit), like Lucene, it's not a search engine in itself. Kind of like how DBs are built on top of Key-Value Stores. But you can definitely build a CRUD abstraction on top of it.
- victor106 2 hours ago
  Thanks for this info.
mdaniel an hour ago
> Nixiesearch uses an S3-compatible block storage (like AWS S3, Google GCS and Azure Blob Storage)
Hair-splitting: I don't believe Blob Storage is S3 compatible, so one may want to consider rewording to distinguish between whether it really, no kidding, needs "S3 compatible" or it's a euphemism for "key value blob storage"
I'm fully cognizant of the 2017 nature of this, but even they are all "use Minio" https://opensource.microsoft.com/blog/2017/11/09/s3cmd-amazo... which I guess made a lot more sense before its license change. There's also a more recent question from 2023 (by an alleged Microsoft Employee!) with a very similar "use this shim" answer: https://learn.microsoft.com/en-us/answers/questions/1183760/...
jillesvangurp 21 minutes ago
Both Elastic and Opensearch also have S3 based stateless versions of their search engines in the works. The Elastic one is available in early access currently. It would be interesting to see how this on improves on both approaches.
With all the licensing complexities around Elastic, more choice is not necessarily bad.
The tradeoff with using S3 is indexing latency (the time between the write getting accepted and being visible via search) vs. easy scaling. The default refresh interval (the time the search engine waits before committing changes to an index) is 1 second. That means it takes upto 1 second before indices get updated with recently added data. A common performance tweak is to increase this to 5 or more seconds. That reduces the number of writes and can improve write throughput, which when you are writing lots of data is helpful.
If you need low latency (anything where users might want to "read" their own writes), clustered approaches are more flexible. If you can afford to wait a few seconds, using S3 to store stuff becomes more feasible.
Lucene internally stores documents in segments. Segments are append only and there tend to be cleanup activities related to rewriting and merging segments to e.g. get rid of deleted documents, or deal with fragmentation. Once written, having some jobs to merge segments in the background isn't that hard. My guess is that with S3, the trick is to gather whatever amount of writes up and then store them as one segment and put that in S3.
S3 is not a proper file system and file operations are relatively expensive (compared to a file system) because they are essentially REST API calls. So, this favors use cases where you write segments in bulk and never/rarely update or delete individual things that you write. Because that would require updating a segment in S3, which means deleting and rewriting it and then notifying other nodes somehow that they need to re-read that segment.
For both Elasticsearch and Opensearch log data or other time series data fits very well to this because you don't have to deal with deletes/updates typically.
mikeocool 3 hours ago
I love all of the software coming out recently backed by simple object storage.
As someone who spent the last decade and half getting alerts from RDBMSes I’m basically to the point that if you think your system requires more than object storage for state management, I don’t want to be involved.
My last company looked at rolling out elastic/open search to alleviate certain loads from our db, but it became clear it was just going to be a second monstrously complicated system that was going to require a lot of care and feeding, and we were probably better off spending the time trying to squeeze some additional performance out of our DB.
- remram 32 minutes ago
  On the other hand, the S3-compatible server options are quite limited. While you're not locking yourself to one cloud, you are locking yourself to the cloud.
- spaceribs 3 hours ago
  This is a very unix philosophy right? Everything is a file?[1]
  [1]https://en.wikipedia.org/wiki/Everything_is_a_file
  pjc50 42 minutes ago
  Not quite - "everything is a blob" has very different concurrency semantics to "everything is a POSIX file". You can't write into the middle of a blob, for example. This makes certain use cases harder but the concurrency of blobs is much easier to reason about and get right.
  Personally I think you might actually need a DB to do the work of a DB, and you can't as easily build one on top of a blob store as on a block device. But I do think most distributed systems should use blob and/or DB and not the filesystem.
- candiddevmike 3 hours ago
  Why would you prefer state management in object storage vs a relational (or document) database?
  mikeocool 2 hours ago
  So many less moving parts to manage/break.
warangal 35 minutes ago
I myself have been working on a personal search engine for sometime, and one problem i faced was to have an effective fuzzy-search for all the diverse filenames/directories. All approaches i could find were based on Levenshtein distance , which would have led to storing of original strings/text content in the index, and neither would be practical for larger strings' comparison nor would be generic enough to handle all knowledge domains. This led me to start looking at (Local sensitive hashes) LSH approaches to measure difference b/w any two strings in constant time. After some work i finally managed to complete an experimental fuzzy search engine (keyword search is a just a special case!).
In my analysis of 1 Million hacker news stories, it worked much better than algolia search while running on a single core ! More details are provided in this post: https://eagledot.xyz/malhar.md.html . I tried to submit it here to gather more feedback but didn't work i guess!
- iudqnolq 29 minutes ago
  I'm super new to this so I'm probably missing something simple, but isn't a trigram index one of the canonical solutions for fuzzy search? Eg https://www.postgresql.org/docs/current/pgtrgm.html
  That often involves recording original trigram position, but I think that's necessary to weigh "I like happy cats" higher than "I like happy dogs but I don't like cats" in a search for "happy cats".
mannyv 21 minutes ago
I forgot that a reindex on solr/lucene blows away the index. Now I remember how much of a nightmare that was because you couldn't find anything until that was done - which usually was a few hours when things were hdd based.
Just started a search project, and this one will be on the list for sure.
whalesalad 25 minutes ago
I recently got back into search after not touching ES since like 2012-2013. I forgot how much of a fucking nightmare it is to work with and query. Love to see innovation in this space.
- staticautomatic 23 minutes ago
  I feel like it’s not that bad to interact with if you do it regularly, but if I go a while without using it I forget how to do everything. I sure as hell wouldn’t want to admin an instance.
ctxcode 16 minutes ago
Sounds like this is going to cost alot of money. (more than it should)
mhitza 3 hours ago
I've used offline indexing with Solr back in 2010-2012, and this was because the latency between the Solr server and the MySQL db (indexing done via dataimport handler) was causing the indexer to take hours instead of the sub 1 hour (same server vs servers in same datacenter).
In many ways Solr has come a long way since, and I'm curious to see how well they can make a similar system perform in the cloud environment.
gyre007 3 hours ago
It took us almost 2 decades but finally the truly cloud native architectures are becoming a reality. Warp and Turbopuffer are some of the many other examples
- candiddevmike 3 hours ago
  Curious what your definition of cloud native is and why you think this is a new innovation. Storing your state in a bunch of files on a shared disk is a tale as old as time.
  cowsandmilk an hour ago
  Not having to worry about the size of the disk for one. So much time in on that premise systems was about managing quotas for systems and users alongside the physical capacity.
- mdaniel an hour ago
  I didn't recognize Turbopuffer but a quick search coughed up a previous discussion: https://news.ycombinator.com/item?id=40916786
  I'm guessing Warp is Warpstream which I have been chomping at the bit to try out: https://hn.algolia.com/?q=warpstream
marginalia_nu 3 hours ago
This would have been a lot easier to read without all the memes and attempts to inject humor into the writing. It's a frustrating because it's an otherwise interesting topic :-/
- prmoustache 3 hours ago
  How hard is it to just jump past them?
  Answere: it is not.
  infecto 2 hours ago
  It generally is a major distraction from the content and feels like a pattern from a decade+ ago when technical blog posts became the hot thing to do.
  You can certainly jump over it but I imagine a number of people like myself just skip the article entirely.
  Semaphor 2 hours ago
  It is.
manx 3 hours ago
I thought about creating a search engine using https://github.com/phiresky/sql.js-httpvfs, commoncrawl and cloudflare R2. But never found the time to start...
- oersted 2 hours ago
  You will like this then, that was the main demo from the Quickwit team.
  https://common-crawl.quickwit.io/
- mallets 3 hours ago
  Many things seem feasible with competitive object storage pricing. Still needs a little a bit of local caching to reduce read requests and origin abuse.
  I think rclone mount can do the same thing with its chunked reads + cache, wonder what's the memory overhead for the process.
ko_pivot 3 hours ago
I’m a fan of all these projects that are leveraging S3 to implement high availability / high scalability for traditionally sensitive stateful workloads.
Local caching is a key element of such architectures, otherwise S3 is too slow and expensive to query.
- candiddevmike 3 hours ago
  The write speed is going to be horrendous IME, and how do you handle performant indexing...
stroupwaffle 2 hours ago
There’s no such thing as stateless, and there’s no such thing as serverless.
The universe is a stateful organism in constant flux.
Put another way: brushing-it-under-the-rug as a service.
- zdragnar an hour ago
  There is no spoon.
  Put it another way: serverless and stateless don't mean what you think they mean.
  MeteorMarc an hour ago
  I feel clueless
  stroupwaffle an hour ago
  It’s not the spoon that bends, it’s the world around it.
  ctxcode 22 minutes ago
  serverless just means that a hosting company routes your domain to one or more servers that hosting company owns and where they put your code on. And that hosting company can spin up more or less servers based on traffic.. TL;DR; Serverless uses many many servers, just none that you own.
cynicalsecurity 3 hours ago
This is a great way to waste investors' money.