Building Search at Airbnb

The Presentation inside:

Slide 0

Building [email protected] Mousom Dhar Gupta

Slide 1

Total Guests Cities 20,000,000+ 34,000+ Countries Castles 190 600+ Listings Worldwide 1,200,000+

Slide 2


Slide 3

That Awesome Slide Title of Yours

Slide 4

Technical Stack ____________________________ DropWizard as a service framework (incl. Jetty, Jersey, Jackson) ZooKeeper (via Smartstack) for service discovery. Lucene for index storage and simple retrieval. In-house built forward index, real-time indexing, ranking, advanced filtering.

Slide 5

Overview Search Web App Search1 150 Search Threads JVM Lucene Index Search2 … SearchN ~30 replicas of same index data

Slide 6

search Lucene ____________________________ Lucene Each box has 8 shards of Lucene Index Latency is 50% less than a single shard index Lucene Lucene Lucene Lucene Lucene Combiner Shards Lucene Filtering and Ranking

Slide 7

Indexing Challenges ____________________________ Bootstrap (creating the index from scratch) Ensuring consistency of the index with ground truth data in real time

Slide 8

What’s in the Lucene index? ____________________________ Positions of listings indexed using Lucene’s spatial module (RecursivePrefixTreeStrategy) Categorical and numerical properties like room type and maximum occupancy Full text (descriptions, reviews, etc.) ~40 fields per listing from a variety of data sources, all updated in real time

Slide 9

Realtime Update master Search 1 Search 2 calendar SpinalTap Medusa … Search N fraud DataStore

Slide 10

Tails binary update logs from Mysql Servers (5.6+) Spinaltap Converts changes in any of the tables into actionable objects called “Mutations” (Inserts, deletes, Updates) Broadcasts them to Medusa using Kafka

Slide 11

Realtime Update master calenda r Search 1 Search 2 SpinalTap Medusa … Search N fraud DataStore

Slide 12

Source of truth for search index data. Medusa Listens to updates from Spinaltap and builds new IndexData by querying ~15 mysql tables from three different databases. Persists everything in a search nodes. DataStore and broadcasts latest version to all Uses ZooKeeper for leader election.

Slide 13

Realtime Update master calenda r Search 1 Search 2 SpinalTap Medusa … Search N fraud DataStore

Slide 14

Forward Index public final class ForwardIndexData private final private final We also have complicated business rules to calculate Price, Availability, InstantBook etc which needs a ton of metadata. ~50 fields built from multiple data source and updated in realtime. HostInfo calendarData; pricingData; hostInfo; . . . . ____________________________ Holds all the metadata about a listing required by scoring and filtering. PricingData private final What’s in the forward index? CalendarData { . . . . } ! public final class CalendarData private final DateRanges private final SeasonalValues { reservationDates; startDayOfWeeks; . . . . } ! private final class private final SeasonalValues<T> DateRange private final T value; . . . . } startDate; {

Slide 15

Availability ____________________________ ! Depends on the profile of guest. The checkin date must be one of the valid start days of the week. Must satisfy seasonal minimum nights. There must be enough preparation time for the host. Import busy dates from external calendars to avoid booking conflict.

Slide 16

Pricing ____________________________ ! Depends on number of guests , number of nights. How close or further away the checkin date is. How long is the trip, does the host have Weekly and Monthly pricing. Is there special price override for these nights.

Slide 17

Instant Book ____________________________ ! Depends on number of guests , number of nights. Profile of the guest like positive reviews, does have profile photo? How much preparation time the host has etc.

Slide 18

Why did we need our custom Forward Index? Needs to store objects with 50-100 fields as values keyed by listing id. Should avoid the cost of serialization/deserialization during every fetch. Requirements Data must be available in-memory for fast lookup, but also persisted on disk. Highly Concurrent, writer shouldn’t block the readers (One writer but >100 reader threads)

Slide 19

Forward Index Interface // Forward Index // Writer public interface ForwardIndex<V> { forwardIndex.put(listingId, listingData); ! . . . Map<Long, V> asMap(); // write to disk and also make it visible to readers. forwardIndex.commit(); void put(long id, V value); ! void putAll(Map<Long, V> values); ! void remove(long id); ! void commit(); ! } // Reader // Fetch forward index data from in-memory map Map<Long, ListingData> fwdIndex = forwardIndex.asMap(); ListingData data = fwdIndex.get(listingId); ! // Use it to evaluate business rules checkAvailability(data, searchRequest); calculatePrice(data, searchRequest)

Slide 20

Forward Index Implementation // Forward Index public class ForwardIndexStore<V> implements ForwardIndex<V> { private final DB<V> diskStore; private final Cache<V> cache; ! . . . . ! @Override Map<Long, V> asMap() { NonBlocking In-Memory HashMap return Collections.unmodifiableMap(cache); } void put(long id, V value) { diskStore.put(id, value); cache.put(id, value); } DiskStore ! . . . . ! void commit() { diskStore.commit(); cache.commit(); } }

Slide 21

Ranking Ranking Problem ____________________________ Not a text search problem Users are almost never searching for a specific item, rather they’re looking to “Discover” The most common component of a query is location Highly personalized – the user is a part of the query Optimizing for conversion (Search -> Inquiry -> Booking) Evolution through continuous experimentation

Slide 22

Ranking Ranking Components ____________________________ Relevance Quality Bookability Personalization Desirability of location etc.

Slide 23

DB snapshots Several hundred signals used to build machine learning models: ! Properties of the listing (reviews, location, etc.) Behavioral signals (mined from request logs) Image quality and click ability (computer vision) Host behavior (response time/rate, cancellations, etc.) Host preferences model Logs

Slide 24

Life of a Query Geocoding Query Understanding Configuring retrieval options Choosing ranking models Retrieval Populator Filtering by Price and Availability 2000 results First Pass Scorer 25 results Quality Bookability Relevance Second Pass Ranking 25 results Result Generation AirEvents

Slide 25

Second Pass Ranking ____________________________ Traditional ranking works like this: ! then sort by In contrast, second pass operates on the entire list at once: ! Makes it possible to implement features like result diversity, etc.

Slide 26

Life of a Query Geocoding Query Understanding Configuring retrieval options Choosing ranking models Retrieval Populator 2000 results Filtering by Price and Availability Quality First Pass Scorer 25 results Bookability Relevance Second Pass Ranking 25 results Result Generation AirEvents

Slide 27

Slide 28