How Search Engines Work
About a week ago, Kosmix, one of the few emerging businesses doing anything significant in Search, announced the release of a search technology that gives them a significant advantage in the search space: the Kosmos File System.
With all our talk of algorithms and indexes, I’m sure you realize that an engine isn’t as simple as a search box and a keyword query but are you familiar with what is really involved and how it affects us? The most fascinating experience for me, of late, has been working with technology behind search and learning that the implications and opportunities are worth your attention as it is this foundation that drives our industry.
As best I can put it, there are 4 critical components to a search engine and it is the combination of these components that make the difference between Google and a simple site search function:
- Content and Data
- Distributed Systems (servers)
A site search box, which is what most websites not using Google Custom Search use, really only needs content and an algorithm (albeit a simple one = if keyword then result) to return results. Search engines must have all 4 components – good search engines excel at them.
Kosmix’s Kosmos File System (KFS) is a significant enhancement to that forth consideration but before I go into how and why, let’s look at each component.
Content and Data
A better part of Search Engine Optimization is about supporting this piece of the puzzle (not the “algorithms”). Much less important than keyword optimization, density, anchor text, or even page titles is exposure of your content to search engines so they can index your site and serve your page for respective queries. No matter how good your titles and page copy, you don’t exist to a search engine if they can’t get to you. Search engines that aggregate content well (Google) excel, those that don’t (ahem… Yahoo) frustrate us to no end.
Think about this not from the context of your role as marketer but as user: Content is a critical pillar of a Search Engine’s popularity – users will use an engine that returns relevant results and abandon one that fails to return what they seek.
Not to suggest this pillar is so simple as aggregating content, good search engines also leverage the data flowing in to their platform by user behavior: popularity of results, search query stream, click stream. Mining that data makes all the difference in the world.
Now let’s talk about the algorithm. True search relevance is not determined merely by keywords but clicks, search activity, popularity, the quality of content, and context of the visit (personalization). Simply enlightened, Google’s algorithm takes these into account with Page Rank (PR), click stream analysis, inlink anchor text, and page keywords. The best example of the importance of the science is the comparison between site search and a search engine. A site search box returns results to the user by looking at the keywords in the title and descriptions of your pages; returning results based on keyword density – No context, no weight based on demand, no measure of the timeliness for your content.
The algorithm only directs the engine; a platform is required to model data, crunch results, and apply that algorithm to the content the engine has crawled. Google’s does this using Map/Reduce (I hate linking to Wikipedia!! increasing their credibility, but that is one of those terms better explained by an encyclopedia) and BigTable which is a massive scale storage technology that allows them to better mine time indexed data and directly support data intensive applications like personalized search, analytics, and ad targeting. Now, Google has a significant advantage with Bigtable in that it allows them to scale by simply adding more commodity servers with very little intervention as it automatically handles load balancing.
Zvents is working in this space with HyperTable, an exceptional technology to be released shortly. Until then, with both of Google’s technologies being proprietary to their engines, Google’s dominance can only be undone with work from the industry here (no it is not the better algorithm that makes the engine); this is where search engines are really competing. Well, this and…
I oversimplify above when I call this the “servers” on which search engines depend though that is a part of it. As you can imagine, search engines need thousands of servers to house the terabytes of data and manage the loads driven by millions of users submitting unique queries. I’ve oversimplified because the real challenge for engines is in data mining that distributed system (data across multiple servers). Google’s distributed file system, GFS, is what gives them a platform advantage, today, over all other engines. Kosmix’s KFS, an open source release, is the next big step in distributed systems following HDFS (Hadoop) and Gluster (GNU Cluster Distribution) which are other iterations the space. What makes KFS so monumental is its exceptional development allowing businesses to easily scale beyond a single machine to build applications and run data mining clusters at parity with Google.
So, why does this matter to you?
It is not all about the algorithm and SEO is about more than keywords. Pay attention to content and keep your eye on the real players in the Search Engine space who are keeping Google on their toes and delivering to users and businesses the future of Search.