Did you find this article on Google… or Ask Jeeves? Anyway this is how it usually works with major known search engines: you type a search phrase, click “search” and here you go – a list of links along with the brief abstract:
But what if you want to build a search system for your web company or for corporate use that would provide you more than just a list of documents the search phrase appears on? It turns out that you can build a system that retrieves not only references to documents but the information itself. And the more you know about the data or the purpose the data is gonna be used for – the deeper is the capabilities. Assume for sake of simplicity that you search for… cars. Specifically, Dodge Ram. Google immediately gives you top links:
But what would you say about different approach. Consider search results like this:
In this case you have much more useful information for “Dodge Ram” – you have a choice of resellers/catalogs to lookup for prices, you have “similar cars”, “news”, different “categories”, Dodge Ram related locations in your neighborhood etc. How to develop a vertical search engine like that?
Based on our experience of building search systems at Luxoft we know that many things depend on what kind of “niche” the engine will serve as well as the initial “capacity” the engine should provide. In most cases the solution depends on the following factors:
Search approach. Are you going to build your search engine on top of another search system or build your own fr om the scratch? The choice is important. If you build your own search index from the scratch you will have 100% flexibility and will be able to get 100% from your data. On the other hand it will take considerable time and efforts to build the index (especially if data volume is big) – if product idea allows it – you can build it on top of other index, for instance Amazon A9 available thru Amazon WS platform or Gigablast or other. This is tough decision and not always obvious at the beginning. Good suggestion is to delay it and prototype the system first. This will help you see a lot.
Data volume. Are you going to search against thousands of user comments or billions web pages? The less raw data you deal with the better from operational perspective. On the other hand broad data corpus will give you ultimate precision in statistical algorithms based on occurrences of data relations. LinkedIn, for example, search for people in their own database only. ZoomInfo, on the other hand, search for people on the web. Both are examples of quality people search systems: you can be fully satisfied by searching within LinkedIn because of well structured data and Zoominfo will provide you less structured data but from more sources giving you another dimension of data. Ideally data volume determines results quality which could be the decision making point in this case. Building a search system is sometimes possible using simple dat abase (e.g. MySQL or Oracle) but sometimes these systems fail to deliver proper performance and the only choice remaining is to build your index literally from the scratch, in some cases you can use very basic data storage systems akin Berkley DB.
Data acquisition. Wh ere you gonna get your data initially and keep updating it over time? This is not an issue if you’re built on top of existing search system but if it is not technically possible – you have to decide that too. It matters a lot what is your data source or sources. If your data is only web pages then possible choices would be either crawling it yourself or acquiring it from other “crawlers”. If you gonna incorporate “deep web” search results into your search, make sure you can acquire data from corresponding third parties. When to crawl versus buy? If you are interested in relatively small amount of data (less than1-10M URLs) you can crawl it without making it separate lifetime project but if your data volume is big – crawling is an expensive project and it could be better to buy data corpus rather than get it on your own.
Algorithms. What type of information you will search for and what sort of results you will deliver to the end user? For instance you might like to be able to recognize a car model on the web page, which is hardly possible without using pre-populated car model dictionaries in conjunction with the context of the web page or specific paragraph etc. All these algorithms consume computational resources at indexing time and/or search time, it increases search index volume, complicates index structure and so on.
These are basic important things company should consider when undertaking a creation of vertical search engine. Building search system is complex task, we know it involves expertise in computational linguistics, text processing (as well as clustering, classification etc) algorithms, parallel algorithms and much more. But the value the quality search system delivers to the user is huge – an ability to get relevant information out of chaff.
Paul is a software architect for Luminis Technologies and the author of “Building Modular Cloud Apps With OSGi”. He believes that modularity and the cloud are the two main challenges we have to deal with to bring technology to the next level, and is working on making this possible for mainstream software development. Today he is working on educational software focussed on personalised learning for high school students in the Netherlands. Paul is an active contributor on open source projects such as Amdatu, Apache ACE and Bndtools.