We’ve had a fun week over here at Topsy. We finished the roll-out of version 2.0 of our platform, a major engineering and operational milestone for us.
Topsy is now the largest searchable index of content posted on Twitter – we recently indexed our 5 billionth tweet and 2.5 billionth link. Unlike most retrieval systems, Topsy organizes its search index in real-time, while still maintaining a long-term history. Our v2 architecture takes our search approach to a new level of scale – it is designed to index over 100 billion status updates and related objects, from any social network.
Under the hood
For search and operations geeks out there, here’s a peek into how Topsy’s v2 index architecture works. Our platform spans a cluster of over 500 servers and roughly a petabyte of storage. We receive tweets from Twitter through the Firehose (and other sources like search.twitter.com and Twitter’s REST API). Each tweet is written to our distributed queuing store (called the Swarm). Swarm provides a mechanism for hundreds of tasks to process a status update within milliseconds of arrival to turn it into indexable data. A typical pipeline to process a tweet looks like this:
- Metadata for the tweet, including user information (such as language, location, user-ID, profile photo) is extracted.
- All links present in the tweet are extracted, short URLs are expanded and URL redirects resolved, and the link is visited to fetch titles, descriptions, and to generate thumbnails for images.
- Relationships to other tweets or Twitter users are extracted and a link is made and verified. (For example, Topsy tracks replies to tweets, and finds and verifies original tweets for organic retweets by computing a similarity score for the original and retweeted text)
- A citation is created from the tweet, associating the user and the tweet text to links in the tweet (or to the original tweet if it’s a retweet).
- The text of the tweet is parsed and tokenized to make it easier to search regardless of the language of the text. We do linguistic analysis making it possible to search Topsy for Japanese and Simplified/Traditional Chinese text, among other languages.
- The tweet is loaded into the search index; Topsy has a unique citation index model, in which a text of a status update – treated as a citation – remains associated with its author and to links it contains. This allows the search engine to rank tweets, photos, and websites by relevance, social citation scores, and time.
- The tweet is loaded into Topsy’s in-memory Grapher, so that it shows up on Topsy Author pages and Topsy Trackback pages.
This pipeline runs tens of thousands of times a second and generates a 10x fanout (each tweet results in 10 or more pieces of data). We add close to 500 million pieces of data to our index every day.
Four Demons of Search
Traditional search architectures are designed to process large volumes data in batch processes. The large-batch approach doesn’t work for processing and indexing on real-time streams.
Search engines are constantly battling what we call the Four Demons of Search: query speed, relevance, frequency of index updates and recall (indexing historical data). Traditional web search engines compromise on update frequency; real-time search engines typically compromise on recall (historical data) and relevance. At Topsy, we’ve architected our search platform to scale up in all four dimensions using modern techniques for indexing and run-time query processing.
This is an exciting new direction in information retrieval, especially as large amounts of real-time data are being created and become available on the web. In a future post, we’ll cover the design choices that have resulted in our approach and architecture.