Comments Page - Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

« Back Show HN: I scraped 3B Goodreads reviews to train a better recommendation modelbook.svSubmitted by costco a day ago

vessenes 3 hours ago
OK, I just added books until you told me I had too many. Fun idea! I have a couple of suggestions:
* UI - once someone clicks "Add" you really should remove that item from the suggested list - it's very confusing to still see it.
* Beam search / diversification -- Your system threw like 100 books at me of which I'd read 95 and heard of 2 of the other 3, so it worked for me as a predictor of what I'd read, but not so well for discovery.
I'd be interested in recommendations that pushed me into a new area, or gave me a surprising read. This is easier to do if you have a fairly complete list of what someone's read, I know. But off the top of my head, I'm imagining finding my eigenfriends, then finding books that are either controversial (very wide rating differences amongst my fellow readers) or possibly ghettoized, that is, some portion of similar readers also read this X or Y subject, but not all.
Anyway, thanks, this is fun! Hook up a VLM and let people take pictures of their bookshelf next.
dbl000 2 hours ago
Echoing what everyone else has said here - awesome site, love how fast it was.
I did notice that when I put in a single book in a series (in my case Going Postal, Discworld #33) that tended to dominate the rest of the selection. That does make sense, but I don't want recommendations for a series I'm already well into.
Also noticed that a few books (Spycraft by Nadine Akkerman and Pete Langman, Tribalism is Dumb by Andrew Heaton) that I know are in goodreads and reviewed didn't show up in the search. I tried both author's name and the title of the book. Maybe they aren't in the dataset.
It did stumble with some books more niche books (The Complete Yes Minister). Trying the "Similar" button gave me more books that were _technically_ similar because they were novelizations of British comedy shows, but not what I was looking for.
For more common books though it lined up very well with books already on my wishlist!
- costco an hour ago
  Yes I would say the handling of series is probably the biggest problem. Once my test metrics got to a point I was happy with and my quality spot checks passed (can I follow the models recommendations from one generic history book to Steven Runciman, making sure popular books don't always dominate the results), I was ready to release because I had been working on this project for so long. The solution is probably using the transformer model to generate 100-200 candidates and then having a reranker on top.
blehn an hour ago
You should filter out authors from the input books in the output. If liked a book by an author, surely I'd read more of their work if I wanted to — recommending them isn't helpful. Along the same lines, I think interesting recommendations tend to be the ones that (1) I like and (2) I didn't expect. The more similar the recommendations are to the input, the more likely I already know them, and the more likely to create a recommendation echo chamber.
mscbuck 19 minutes ago
Awesome site and speed!
My advice from someone who has built recommendation systems: Now comes the hard part! It seems like a lot of the feedback here is that it's operating pretty heavily like a content based system system, which is fine. But this is where you can probably start evaluating on other metrics like serendipity, novelty, etc. One of the best things I did for recommender systems in production is having different ones for different purposes, then aggregating them together into a final. Have a heavy content-based one to keep people in the rabbit hole. Have a heavy graph based to try and traverse and find new stuff. Have one that is heavily tuned on a specific metric for a specific purpose. Hell, throw in a pure TF-IDF/BM25/Splade based one.
The real trick of rec systems is that people want to be recommnded things differently. Having multiple systems that you can weigh differently per user is one way to be able to achieve that, usually one algorithm can't quite do that effectively.
yoz-y 2 hours ago
It works pretty well in the sense that after inputting only a few quite diverse books it gave me recommendations for a lot of books that I’ve already also read and enjoyed.
I would also really like a possibility to add negative signal. It did also recommend books that seemed interesting to me but I ultimately didn’t like.
Overall quite impressive.
- garciasn 8 minutes ago
  About 90% of the books recommended from the 15 I put in I've read. This means it's great at recommending but not new stuff for me :(
varenc 2 hours ago
I love this site, and the approach! Great seeing someone making good use of Goodreads data.
Sadly my experience with the book recommender isn't too great because of the 64 book limit. If I import either the most recent or least recent 64 book, 95% of the books it recommends to me are books I've read. Though it was helpful for spotting a few books I've read that I didn't log on Goodreads. Guess I'm pretty consistent.
- costco 2 hours ago
  I think I will expand the input books limit (sadly requires retraining) and or the output books limit of 30.
fennec-posix 16 minutes ago
Very neat. Even found a couple Cold War-setting books to read and an entire series of 6 books on the same topic, All from searching up Team Yankee.
Thanks for the new reading list :D
aj_hackman 3 hours ago
Thank you! Because of this, "The Making of Prince of Persia: Journals 1985–1993" by Jordan Mechner is on its way to my house.
- qingcharles 2 hours ago
  You definitely will not regret that purchase. It's a very enjoyable read.
mcbrit 2 hours ago
I don't know. I entered, trying to be popular but at least slightly? opiniated:
Tigana, Hyperion, A Fire Upon the Deep, Blindsight, Moby Dick
and I got a list. Sure, read all that or wasn't interested for reasons, I added (only Neuromancer on initial recommendations):
Neuromancer, VALIS, Quantum Thief, Towing Jehovah.
List did not get more interesting.
Book recommendations are still kind of difficult.
- mcbrit 2 hours ago
  If I provide that list, a (real) person doesn't ask me if I've read the Hobbit.
- teaearlgraycold 2 hours ago
  I don’t think past liked books are nearly enough information to provide a good book for you today. You need a lot more information about the state of someone’s mind.
  mcbrit 2 hours ago
  You're talking to a dude. (in my case.) I mentioned 8 books.
  I won't tell you exactly what to do, but one way to do it is to measure your surprise with me choosing each of those 8 books when you provide a recommendation back to me of what I should read next. I think I get kind of that experience talking to someone about books.
  The algorithm didn't do that.
  teaearlgraycold an hour ago
  Talking to someone about books gives you so much more information than a book list. Their expressions, their accent, their energy level, their clothes, and many other things help to provide supplemental information.
MattGrommes 2 hours ago
This is cool but I'd love the option to filter out the author of the book you entered. I put in Shroud by Adrian Tchaikovsky and almost all the books are others by him, which is fine but doesn't really mix up the stuff I'm reading.
nickthesick 33 minutes ago
I have a web app https://bookhive.buzz which is a GoodReads alternative based on BlueSky’s protocol. I scrape all of the book data from Goodreads too.
I would love to be able to add a recommendation system based on this.
sodality2 an hour ago
This is fantastic!!! I've added many results to my want-to-read list, they're very on-point from very few inputs. It would be really cool to import from a user ID, where you can choose some subset of your read list to inspire new suggestions, while excluding all books in your want-to-read and already-read lists. But that's an ongoing scrape to maintain, it's a cat and mouse game you probably don't want to start. I wonder what the legal status of scraped training data is... if you don't reproduce any of the review data I presume you're fine?
- costco an hour ago
  You can import the first or last 64 books of your read, to-read, or currently-reading shelves if you press the "Import Goodreads" button and provide your Goodreads ID.
  sodality2 38 minutes ago
  D'oh, didn't even notice that button :P Wow, that greatly improved the recommendations, it even found a book I wouldn't say is particularly related to the others but I found it interesting-sounding. Thanks for such a cool site!!
walthamstow 2 hours ago
Works pretty well with cookbooks. Very cool work.
One suggestion would be to make the search less strict on diacritics. Searching for popular cook J. Kenji López Alt was only successful if I entered the correct O.
xkbarkar an hour ago
Have nothing to add that hasn’t already been commented. Like the entries in the add list stay. Other than that, my recommendation list keeps coming up with books I have already read and loved and I am hitting the limit :(.
So filtering would be great,
I have seen a few versions of the same books listed more than once.
Loved this. Hope you get to tune it a little.
Also, thank you for not ruining the site with a single popup, email subscription list offer, chatbot, wheelspin from hell anywhere.
Blessings from the popup hating part of the interwebs.
androng an hour ago
I tried to import my book list with "Import goodreads" button and inputting https://www.goodreads.com/user/show/68515148-andrew but it said "import failed, see console"
- costco an hour ago
  Worked for me, could be due to server being overwhelmed
  Here is the URL with your books: https://book.sv/#52752877,46049530,18437030,52480873,3260654...
NitpickLawyer 2 hours ago
Interesting. I tested it with sci-fi, and it definitely recommends good books, but not sure how accurate it is at surfacing the sub genres / themes. For example for [aurora -ksr, seveneves, project hail mary, ender's game] it gave me dune. Which is a great book, but not in the "first-ish contact" style I hoped it would be.
Another thing I noticed is that it tends to recommend 2nd and 3rd books in a series, which is a bit so-so. If I add the first book in a series, I probably already read the whole series...
- 28304283409234 2 hours ago
  Came here to say this (recommending book 2 and 3 in a trilogy). Great app otherwise!
jamesponddotco 3 hours ago
The recommendations are pretty good; even though I only input six books, it was enough for it to recommend books I have on my wish list. Definitely going to play around some more. Plus, the website is super fast, very impressive.
Any chance we could get an API going at some point? Are you planning to open source the work?
I'm interested in the scrapping of Goodreads too. I'm building a book metadata aggregation API and plan on building a scrapper for Goodreads, but I imagine using a data center IP address will be a problem very fast. Were you scrapping from your home network?
- costco 2 hours ago
  Thank you for the compliments :) I used 50-100 datacenter proxies. I just logged requests made by the iOS app with Charles and then recreated the headers to the best of my ability though the server did not seem to be very strict at all. Worth noting though that static residential proxies are not too expensive these days anyways.
  Re the API: The model does actually run fairly well on CPU so it probably wouldn't be too expensive to serve. I guess if there is demand for it I could do it. I think most social book sites would probably like to own their recommendation system though.
  goatsi 2 hours ago
  Speaking of sustained scraping for AI services, I found a strange file on your site: https://book.sv/robots.txt. Would you be able to explain the intent behind it?
  costco an hour ago
  I didn't want an agent to get stuck on an infinite loop invoking endpoints that cost GPU resources. Those fears are probably unfounded, so if people really cared I could remove those. /similar is blocked by default because I don't want 500000 "similar books for" pages to pollute the search results for my website but I do not mind if people scrape those pages.
  dbl000 2 hours ago
  I would love an API or the dataset if you could share it somehow! Just to play around with my own book lists.
stevage 41 minutes ago
This is great. would be really nice to be able to reject suggestions though.
nsypteras 2 hours ago
I'm impressed it recommended so many books i've already read and liked! I have a big reading backlog but once it's whittled down I will likely come back to this. One feature request would be to also show a "why this is recommended" for each recommendation so I can further narrow down the list for what I'm looking for
qingcharles 2 hours ago
I put in a bunch of books and hit recommendations and... I'd already read 95% of them, so at least we know it works well! (checking out the other 5% now)
p.s. one idea: when you click [Add] on the recommended books list, it should remove it from that list
p.p.s. if there is a way to filter out the spam "Summary of ____" books, that would be good too
- jacquesm an hour ago
  I have a hard time remembering titles of books I've read if they are not directly related to the subject matter. No problem remembering the content though. With movies I remember both.
jimmoores 2 hours ago
I unexpectedly liked this. I thought the recommendations were actually useful.
- parkersweb an hour ago
  I sadly didn’t share that experience - I fed it my goodreads most recent - but it largely picked up on 2 or 3 series I’ve been slowly working my way through so that most of the recommendation list was ALL the other books in the series (and the spin-off series) so I didn’t really get anything useful…
nwhnwh 2 hours ago
I entered "Alone Together: Why We Expect More from Technology and Less from Each Other" and I received books about Steve Jobs, Harry Potter and "The Subtle Art of Not Giving a F*ck". Like how???
- costco 2 hours ago
  If you want recommendations solely based on one book, please try the similar page: https://book.sv/similar?id=13566692
  These seem to fit the description you are going for better. The model is trained to predict the next book in the sequence. Those other books you listed happen to be very popular, so in the absence of information about you (only having 1 book), the model will tend to recommend those.
- BeetleB 2 hours ago
  > Provide 3+ books for best results.
skayvr 2 hours ago
I've worked in recommender systems for a while, and it's great to see them publicized.
SASRec was released in 2018 just after transformer paper, and uses the same attention mechanism but different losses than LLMs. Any plans to upgrade to other item/user prediction models?
- costco 2 hours ago
  I'm not an expert by any means but as far as sequential recommendations go, aren't SASRec and its derivatives pretty much the name of the game? I probably should have looked into HSTUs more. Also this / sparse transformers in general: https://arxiv.org/pdf/2212.04120
  skayvr 2 hours ago
  There's a few alternatives, but SASRec is a good baseline for next-item recommendation. I'd look at BERT4Rec too. HSTU is definitely a strong step forward, but stays in the domain of ID models. HSTU also seems to rely heavily on some extra item information that SASRec does not (timestamps).
  Other models include Google's TIGER model which uses a VAE to encode more information about items. Similar to how modern text-to-voice operates.
  costco an hour ago
  Thank you for the recommendations. I didn't try BERT4Rec because I assumed it would perform the same or worse as what I already had after having read https://dl.acm.org/doi/pdf/10.1145/3699521. The TIGER paper seems interesting - I definitely want to explore semantic IDs in general and also because I think it could allow including more long-tail items.
  bigskydog 2 hours ago
  Recommend OneRec which is an improvement of HSTU and it recently became open source
_virtu an hour ago
Hey OP I’m building a bookclub app. Do you happen to have an api I could plug into? I’d love to add this to our member suggestions section.
noir_lord 3 hours ago
It has a tendency to recommend books in the same series as are input (putting aside that if I like a book in a series I've likely already read the series).
It did suggest Murderbot Diaries (not on the input but a series I have read and did like) and an Adrian Tchaikovsky I hadn't read :).
- costco 2 hours ago
  It's explicitly trained to predict the next book read in a sequence, which is why you get that behavior. There's probably a better way for me to handle it rather than having 5 books from the same series tend towards the top though.
  noir_lord an hour ago
  If you have the data to know the other books in a series maybe split the results so you have "books in series" in one column and "books not in a series mentioned" in the other but other than that it did a better job than Kindle recommendations which are often hilariously off the mark.
- bananaflag 2 hours ago
  Yeah the hardest problem for recommendation systems is to find non-Star Wars books which are like some specific Star Wars books and unlike some other Star Wars books. I would say it's AGI-complete ;)
  noir_lord an hour ago
  Ironically that is one of the few uses where I've found an LLM to actually be useful.
  ChatGPT does a fairly good job at letting you negate/refine whatever it was you where looking for.
__alexander 2 hours ago
Care to share the scrapped data? I would love to play around with it.
- costco 2 hours ago
  Not sure if I can. At the very least book descriptions most likely could not be distributed. There is an academic dataset with around 200M reviews though: https://cseweb.ucsd.edu/~jmcauley/datasets/goodreads.html
- demaga 2 hours ago
  I am not sure about legal side of things here, but a Kaggle dataset would be really cool
- guelo 2 hours ago
  I'm surprised he got that much data. Goodreads uses several tricks to try to stop scrapers, for example pagination only works up to a few pages.
  jacquesm an hour ago
  They might send him a bill for use of resources.
comrade1234 3 hours ago
I gave up on goodreads reviews. I've been burned too many times by highly rated books that weren't that good. If you're into (horny) ya romance fantasy then goodreads is great, but it's not for me. I haven't really found a substitute.
- owenversteeg 2 hours ago
  Any broadly used ratings system is total garbage. Goodreads ratings, Google Maps ratings, Amazon reviews, Vivino for wine, et cetera. Even assuming the reviews are real and genuine, most people just aren’t good at writing reviews, and the handful that are often have wildly different criteria than you. Someone already commented with one enthusiast site - and sure, enthusiast sites are often better than the mainstream option (see also: CellarTracker for wine) but honestly my advice is to get good at determining the quality of the thing yourself. For books there are a ton of hints about what you’ll be getting. “NYT Bestseller”, “xyz book club”, certain publishers, who’s quoted on the back, when was it published, who wrote it? All of those things can help you rapidly identify books. I personally dislike most modern books and prefer the “classics”, so a lot of this is only useful as a negative signal, but even then there are positive signals, for example a reference to a much older book.
- HeinzStuckeIt 2 hours ago
  GR is also great if you are into academic nonfiction, Classics, poetry, etc. The site does, after all, let you track and review any publication with an ISBN. What my peers and I use it for is worlds apart from the romance novel or LGBT young-adult book reviewing community that often puts GR in the news, and far away from all the drama that rages around genre fiction.
- jamesponddotco 2 hours ago
  I'm not into the social aspect, so Goodreads was never an option, but Hardcover[1] seems like a pretty good alternative.
  [1]: https://hardcover.app
esafak 3 hours ago
It is interesting that you chose a contextual recommender when you would think book affinity is not very susceptible to context. Did you try other models too?
thinkcontext a day ago
I'm impressed! It didn't take many books for it to start suggesting other books that I liked and it showed me several solid choices I'm adding to my queue.
djoldman 2 hours ago
Can you share the details about the Meilisearch instance? How big is the box and database size?
- costco 2 hours ago
  Everything (namely Meilisearch, Postgres and the web server in Go) besides the model inference is running on a Hetzner server with a large SSD and an "AMD Ryzen 7 3700X 8-Core Processor." The data.ms directory is about 40GB. Once the HN traffic dies down I will probably move the model back to the Hetzner server so I don't have to pay $0.15/hour for an A4000.
tristor 31 minutes ago
Two bugs to know about. First, you are using a deprecated API call that fails in Firefox. Second, you are using an HTTP endpoint that fails to upgrade to HTTPS to call the GoodReads API, which also fails with HTTPS-Only enabled in both Chrome and Firefox.
The idea seems good, but since I can't import my GoodReads successfully, it's hard for me to try
momocowcow 2 hours ago
Whatever I put in, it wants me to read Sapiens :_(
- oever 43 minutes ago
  Can confirm. Stallman, Torvalds, Orwell, Harari
  https://book.sv/#2300585,644416
jauntywundrkind 2 hours ago
Where do nice scrapes like this end up? Are there BitTorrents out there for scrapes like this?
Honestly this would finally be the web2.0 we all wanted & hoped for. It's against majesty that it's all captured owned user content that is legally captured by essentially public message boards/sites.
skerit 3 hours ago
Please make this for tv series too!
submeta 2 hours ago
Like the idea! Wondering: Weren’t the early LLMs trained on data in Goodreads as well? I can upload and ask ChatGPT as well, and it will give me similar recommendations, no?