Building revyu.ai: production lessons from a hotel-review RAG system
By Levon Travajyan, digibricks GmbH
I'm picky about hotels. Cleanliness, breakfast, beach quality if it's a beach trip, wifi reliability if I have to work, location, room size, whether the spa is worth the price. Before any trip I'd lose hours moving between Booking.com, TripAdvisor, check24, holidaycheck, opening review after review trying to find one specific answer to one specific question.
The first version of revyu.ai came directly from that. A Chrome extension for Booking.com that lets you ask a question on any hotel's page and get an answer grounded in the real reviews and the hotel's official data.
Early on, I was staying at a hotel where the wifi was unreliable. I ran the prototype on the hotel's Booking page and asked it about the wifi. It came back with a specific answer: wifi worked poorly in certain rooms, and one of them was room 215. My room was 217. The system had pulled that detail out of one review buried among hundreds, without me telling it to look for room numbers. That moment is when I knew the architecture was doing something real.
This piece is the story of what I built around that moment, the engineering decisions that shaped the product, what didn't work, and what I went back and measured after pausing active development.
The first version
revyu.ai launched as a Chrome extension for Booking.com. You open a hotel on Booking.com, a side panel appears, and you can do two things. You can chat with the hotel through a question-answer interface, and the system answers from real guest reviews plus official hotel data. Or you can switch to an analytics tab that shows a breakdown of how the hotel is rated on the dimensions that actually matter (cleanliness, location, value, service, amenities) along with a small chart of the last 12 months of guest ratings. The idea behind the analytics view was that you'd start there to get a quick read on the hotel, then ask follow-up questions about anything that looked off.
I built the first prototype solo at the start of 2025 as a side project. By spring it felt usable. We partnered up for launch: a co-founder on product design, a second engineer on the production frontend, me on the backend. We launched on Product Hunt in August 2025 and reached around 1,000 browser extension installs in the following months. Real users sending real queries.
The whole thing ran on AWS, with Pinecone for vectors, Redis for caching, and DynamoDB for durable storage. Infrastructure as code via Terraform, in eu-west-1.
The cold-start problem
The interesting engineering started when real users arrived. What follows are the problems that shaped the system.
The first one was speed. When a user opens a hotel page, the extension needs to either return immediately (if we already have that hotel's data) or kick off a fetch-and-process pipeline (if we don't). The pipeline for a new hotel was: fetch reviews from the Booking.com API via RapidAPI, embed each one, store the embeddings somewhere we could query fast, store the raw reviews somewhere we could pull full text from. End to end this took about 20 seconds for a hotel we'd never seen before.
20 seconds is fine for the rare case. Unacceptable as a default experience.
I built a pre-warming pipeline. I'd periodically identify the most-booked hotels per major destination and pre-process them in batches during off-peak hours. When a user opens a popular Berlin hotel, the answer comes back in 6 to 10 seconds because the embeddings and reviews are already in our system. When they open something rare, it falls through to the cold path. The cap on what we kept was 1,000 reviews per hotel or 12 months of history, whichever came first. Older reviews matter less for current quality questions and the cap kept embedding costs bounded.
The lesson here was a category of problem I'd revisit later in client work: AI features have a fundamentally different latency profile for "cold" vs "warm" users, and the engineering effort goes into making the warm path feel instant while keeping the cold path tolerable. Pre-computing what you can, deciding what's worth pre-computing, and instrumenting which path each request actually took.
The vector store, and the €1,000 a month problem
The first attempt at vector storage was AWS ElastiCache Redis. We already had Redis in the stack for caching hotel data and conversation state, so adding vectors there felt obvious. It also turned out to be the wrong choice for a fundamental reason: AWS ElastiCache Redis didn't support vector search at the time. Once I hit that, I moved to a self-hosted Redis on ECS using the Redis image with vector search support.
Self-hosted Redis worked, but the bill was around €1,000 a month and growing within a few weeks. We were holding embeddings for roughly 15,000 to 20,000 hotels across all our pre-warmed locations, which came out to around 20 million review-level vectors in storage at peak. The cluster needed enough memory to hold them all. Most of that cost was paying for memory I wasn't actively querying. This was mostly pre-warmed inventory rather than continuously queried hot data, which is exactly why memory-heavy Redis was the wrong economic model for our access pattern.
I migrated to Pinecone serverless. Base plan around €50 a month plus usage. Same retrieval interface, same query patterns, about 90 percent cost reduction. The migration itself was straightforward: write a new vector store implementation behind the existing interface, run them in parallel briefly, switch the feature flag.
The lesson: the obvious storage choice for the rest of your stack is rarely the right choice for vector data once you're past a few hundred thousand vectors. Vector stores have different access patterns, different scaling economics, and different operational concerns from your application database. I now check vector storage costs first in any audit of a production AI system.
The disambiguation surprise
A few months in, OTAs (online travel agencies) reached out about embedding revyu.ai as a widget on their own pages. Same product surface, different host. The engineering implication I underestimated: when you embed a widget into someone else's site, you don't get their canonical hotel IDs. You get whatever string they happen to pass, usually a hotel name. Sometimes a name and a city. Sometimes in the user's native language.
Take "Park Hyatt Berlin." That might arrive as Park Hyatt Berlin, Park Hyatt Berlin Mitte, Park Hyatt Hotel Berlin, Hyatt Park Berlin, Парк Хаятт Берлин (in Russian), or one of several other variants from OTA catalogs that don't quite match Booking.com's exact name. Multiply that across every chain and every small independent hotel that changed names when it switched brands, and we needed a resolver that could land on the right Booking ID from messy input.
I built a multi-stage resolution chain. The cheap first stages are DynamoDB lookups against an alias table I'd been building since launch, keyed by normalized name and city. Then progressively looser matching against Booking's catalog via RapidAPI, with strict and lenient passes at each level. The last two stages are LLM-based: ask gpt-4.1-mini to normalize the messy name into something Booking will recognize, and as a final fallback, a web-search-grounded normalization that asks the model to find the right hotel by searching the public site. Every successful resolution writes back to the alias table so the next request hits the cache, and each stage has a one-attempt budget so the chain can't loop. In production, most queries resolve in the first one or two stages once the cache warms up. The expensive fallbacks fire on cold cases and we eat the latency once.
Standard engineering, but worth thinking about before signing a B2B integration deal: catalog-shape mismatches are the kind of long-tail problem that doesn't exist in single-source consumer products.
The follow-up question problem
A few weeks after launch I noticed a pattern in user sessions. The system answered each question correctly in isolation, but follow-up questions that depended on prior context fell flat. The hotel was always known (the extension reads the hotel ID from the page, the widget receives it from the host). What was missing was conversation memory. "How's the wifi?" got a good answer. "And the breakfast?" got an answer about breakfast for the same hotel, but a generic one, because the system had no idea that "and the breakfast?" was a follow-up to a previous turn that might have established context like "for a family trip" or "during summer." Each query was being answered as if it were the first.
Building real conversational state is harder than it looks. The naive approach is to replay the entire chat history with each new query, which works for short conversations but burns tokens fast and breaks down once you cross the model's context window. What I built instead is a hybrid: a sliding window of the most recent six turn pairs cached in Redis with a 24-hour TTL, plus a rolling LLM-generated summary capped at 120 words held in DynamoDB for durability.
Each new query goes through a rewrite step where gpt-4.1-mini takes the recent turns, the rolling summary, and the new question, and produces a standalone reformulation. "And the breakfast?" becomes something like "For Hotel X in Berlin, how good is the breakfast?" The reformulated query is what we actually embed and search with.
This works, but it costs three LLM calls per turn (the rewrite, the main analysis, the summary update) instead of one. Latency went up by about two seconds. It's the single biggest cost driver in the chat path, and one of the first things I'd revisit if I were running this commercially.
The recommendations feature
A few months in we expanded beyond single-hotel Q&A. Users wanted to start from a sentence and end at a bookable hotel. "Quiet hotel near the beach in Crete for a family with two kids in August, budget around €200 a night."
I built a multi-stage pipeline for this. First, an LLM call to parse the free-text query into structured intent (location, dates, traveler composition, budget, required facilities, preferences). Then a Booking.com search with whatever filters their API supports directly. Then a second filtering pass using our own data, the review embeddings and the aspect-level data we'd already pre-computed, to apply preferences that aren't directly available as Booking filters. "Quiet" isn't a Booking filter. "Good for families" isn't a Booking filter. But you can score every candidate hotel against the aspect-level data we'd stored, rank them, and surface the top matches.
The end result is a list of hotels with links that drop the user straight into the Booking page with their preferences pre-selected. They go from a sentence to a bookable hotel in one or two clicks.
This is also where the aspect categorization comes in. When we ingest reviews for a hotel, we generate per-aspect summaries and scores across about ten to fifteen aspects (cleanliness, location, value, service, amenities, room size, food quality, noise level, family-friendliness, business amenities, and others). When a recommendation query needs to filter on "cleanliness above some threshold" or "quiet above some threshold," we score every candidate hotel against the relevant aspect instead of running a vector search per hotel. It's coarse compared to semantic search, but for the multi-stage recommendation pipeline it's fast enough to be useful as a filter.
End-to-end recommendation latency was the worst part of the experience. The intent parsing is one LLM call. The candidate fetch is one RapidAPI call. The aspect filtering is fast. But generating per-hotel explanations for the final ranked list ("Why is this hotel a good match for your query?") is one LLM call per hotel. For ten results that's eleven LLM calls per recommendation query. We tuned concurrency to make it feel acceptable, but it's expensive at scale.
This is the point where I started seriously thinking about specialized small models.
What I was thinking about and didn't ship
The economics of running general-purpose frontier LLMs against narrow domain problems get bad fast. The system used gpt-4.1-mini throughout (query rewriting, analysis, recommendation intent parsing, per-result explanations). Each call is fast and cheap in isolation. Multiply by ten or eleven calls per recommendation query, by a thousand users, by a steady growth curve, and you have a unit economics problem.
The direction I was sketching but didn't ship was a small fine-tuned model specifically for hotel review reasoning. Take a 3 to 7 billion parameter open model (something in the Qwen or Mistral family), fine-tune it on a corpus of hotel-domain queries and answers (which we had the data to construct from real production traffic), and use it for the high-volume routine tasks: aspect labeling, source attribution, recommendation explanations, simple factual Q&A. Keep gpt-4.1-mini for the genuinely hard cases (multi-aspect reasoning, ambiguous questions, edge cases). The architecture would be a routing layer that decides which model to use based on query complexity, with the small model as the default and the larger model as the fallback.
This would have done two things. It would have reduced per-query cost by an order of magnitude on the routine paths because we'd be running the small model on our own GPU instance instead of paying per-token to OpenAI. And it would have cut latency on those paths because we'd be running locally rather than calling an external API.
This is the work I'd start with if I were rebuilding revyu.ai for a serious commercial run today. My hypothesis is that a large share of routine hotel-review tasks (aspect labeling, source attribution, recommendation explanations, simple factual Q&A) could move to a domain-specific small model, with a frontier model kept as fallback for harder cases like multi-aspect reasoning or ambiguous questions. I haven't shipped this on revyu.ai, so I'd treat it as the first experiment to run rather than a guaranteed optimization. But the pattern itself, replacing the right slice of LLM calls with a fine-tuned small model, is one of the directions I would evaluate first for teams running high-volume LLM features in 2026 rather than scaling their OpenAI bills indefinitely.
What didn't work as a business
By December the unit economics of the B2B widget deals were clear. Each OTA contract was modest. The path to scale was either many small contracts (which neither co-founder was equipped to chase at the volume needed) or a few large enterprise contracts (which our network and timing didn't support).
The product worked. Real users used it. Real businesses wanted to embed it. The go-to-market shape just didn't fit our circumstances.
We made the call to pause active development. The extension is still live for existing users. The infrastructure is intact. The codebase is current and runnable. We'd build something on top of revyu.ai again if the right business shape appeared.
What I took from this part is something I now think about for every client engagement: a working AI product and a viable AI business are different things. The engineering can be excellent and the unit economics can still kill you, and vice versa. The conversations I now have with potential clients usually start with the business question, not the engineering one.
What I went back and measured
After the initial version shipped and ran against real user traffic, I had a specific question I kept coming back to: how much retrieval quality was the original design leaving on the table?
The original retrieval embedded each review as a single 1536-dimensional vector. A typical review reads like "breakfast was great but the wifi was terrible and the staff was friendly." That entire mix becomes one vector. When a user asks "is the wifi good?", the vector for that review sits somewhere in the middle of the embedding space because most of its content is about other things. The signal is diluted.
I knew this was a weakness when I shipped. Whole-review embeddings worked well enough for the user base we had. Working well enough is not the same as working well.
I built a v2 retrieval pipeline alongside the original, kept everything else identical (same embedding model, same downstream LLM, same hotel filter), and benchmarked both on a held-out evaluation set of 26 questions across 13 hotels in five categories (specific amenity queries, vague quality queries, multi-aspect queries, contrastive queries, and edge cases including non-English queries). For each question, the eval recorded which review chunks the system returned and used two scoring approaches: standard retrieval metrics (was the right content in the top results) and an LLM-as-judge that read the question plus the retrieved chunks and rated whether the chunks could plausibly answer the question.
v2 made three changes. First, chunking at the sentence level instead of the whole review. A typical review contains four to eight sentences about different things; breaking it into sentences means each retrievable unit is one specific claim, not a mix. Second, hybrid retrieval that runs both a semantic search (vector similarity) and a keyword search (BM25, a classic lexical ranking method) in parallel and merges the two ranked lists using a fusion algorithm (Reciprocal Rank Fusion, RRF). Third, a second-stage reranker (Cohere rerank-v3.5) that takes the top fused candidates and rescores them with a model trained specifically for relevance ranking.
The results:
| Metric | v1 | v2 | Change |
|---|---|---|---|
| Recall@5 (lenient) | 0.202 | 0.621 | +0.420 |
| Recall@10 (lenient) | 0.291 | 0.731 | +0.441 |
| Recall@5 (strict) | 0.000 | 0.530 | +0.530 |
| MRR | 0.593 | 0.923 | +0.330 |
Two findings from the experiment matter as much as the headline numbers.
BM25 contributed approximately zero measurable lift. When I ran v2 with and without the BM25 leg, the metrics were essentially identical. The Cohere reranker did all the work the lexical signal would have done. I shipped sentence-level chunking with dense plus Cohere rerank as the production retrieval path, and dropped the BM25 stage. About 150 lines of code and one dependency that didn't earn their keep on this corpus. I kept the broken and working BM25 runs in the experiment results because "we tried hybrid search and it didn't help" is more useful information than "we used hybrid search."
The LLM-judge metric regressed by 0.2 points, and that regression was measurement, not quality. v1's whole-review chunks contain four to eight sentences each, so even when only one sentence is relevant, the judge sees plenty of surrounding context and rates the chunk "answerable." v2's single-sentence chunks have less material per chunk and the judge sometimes downscored them despite returning the actually-relevant content. On one question, v1 retrieved none of the three expected chunks but got a 5/5 from the judge because the wrong chunks happened to mention similar topics in passing. v2 retrieved all three expected chunks and got a 4. The retrieval metrics (Recall, MRR) are the load-bearing numbers in the table above, and the takeaway for me is to never trust a single-LLM judge on sentence retrieval without human spot-checks.
v2 adds about 200 to 600 ms at the retrieval step, mostly from the Cohere rerank call. End-to-end latency is around 900 to 1400 ms in production. Acceptable for a user-facing widget that already runs an LLM analysis step downstream.
Cost-wise, the production v2 adds roughly $0.002 per query (Cohere Rerank pricing), or about $2,000 a month at one million queries. Acceptable in exchange for the retrieval quality jump.
v2 is the active retrieval path for new queries on revyu.ai today. Existing hotels are being migrated to the v2 sentence index gradually rather than in one big backfill, with the v1 retrieval handling fallback during the transition. Both paths share the same downstream answer generation and source labeling, so user-facing behavior is consistent.
What I'd do differently in 2026
Beyond the retrieval rebuild, several other choices I shipped would be different now:
Consolidate the chat-turn LLM calls. Three calls per turn (rewrite, main analysis, summary update) was always overkill. Modern retrieval handles follow-ups reasonably well with the last few turns inlined directly into the main prompt, removing the rewrite step entirely.
Single-query metadata filters instead of N-query fan-out. The cross-hotel recommendation search does up to thirty independent Pinecone queries with concurrency. Pinecone supports hotelId: { $in: [...] } metadata filters that collapse the whole fan-out into one call. This was a latency footgun the whole time.
Structured outputs instead of regex-extracted JSON. The analysis service uses regex to extract JSON from LLM completions. OpenAI now supports response_format: { json_schema: ... } which makes this enforceable at the API level and removes a whole class of parsing failures.
Evaluation infrastructure from day one. I built revyu.ai with informal testing: asking it questions about hotels I'd stayed at and grading the answers myself. No recall@k, no MRR, no offline eval set. For client work in 2026 the eval pipeline is the thing I build first, before the retrieval pipeline, because you can't improve what you can't measure.
What this means for the work I do now
revyu.ai gave me end-to-end production experience with the parts of building AI features that demos skip. Cost-aware vector storage choices. Tenant-aware widget infrastructure. Genuinely messy real-world problems like name disambiguation across multilingual catalogs. The full range of failure modes that only appear when real users send real queries against a system that has to actually work.
It also gave me an honest picture of what I'd built and where the limits were. Going back later and measuring those limits, then publishing what changed, is the discipline I now bring to client work through digibricks. Most production AI features I see in 2026 have a similar quality ceiling sitting in their retrieval or generation layer, and the engineering team that shipped it usually knows roughly where the ceiling is. Finding it, measuring it, and lifting it is the work.
If your AI feature is already live, but your team can't clearly answer questions like what does it cost per query, how accurate retrieval is on the queries that actually matter, and where it fails, that's exactly what I look at in an AI Feature Audit.
External link: https://cal.com/digibricks