Friday, April 4, 2008

Validation

Web pages are written in the HTML or XHTML language. Like any language there are certain ground rules or standards. The process of testing a page against these standards is called validation . The HTML standards are defined by the World Wide Web consortium (www.w3c.org). They help web authors to write pages that can be used by a variety of different browsers, Internet Explorer, Firefox, Safari or even Lynx. Despite these standards the world of HTML seems somewhat anarchic. This sentiment stems from the browser wars of the mid 90s when Netscape and Micrsoft added new features to HTML with scant regard for standards. There is a feeling among some SEOers that standards are not important for search engine ranking and there is some truth in this. Point enough inbound-links at an junk page with good anchor text and Google will probably end up returning that page in its results. This feeling is reinforced by statement from Google:

"Any parser which is designed to run on the entire Web must handle a huge array of possible errors. These range from typos in HTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters, HTML tags nested hundreds deep, and a great variety of other errors that challenge anyone's imagination to come up with equally creative ones.

In other words, forget standards, our search engine will sort out your HTML spaghetti whatever.

While this may have been true when Google relied heavily on its PageRank algorithm it certainly won't prove true as search engines move into better understanding of the semantics of page. It is easy for some web authors to forget that HTML is not a formatting language but a mark-up language. It describes the structure of the document in terms of Title elements, Heading levels, Lists, Quotations etc. While these are on-page elements and therefore prone to manipulation they still give the search element clues to the meaning of the page. Even worse, syntax errors with HTML tags may cause the search engine to confuse content and mark-up or to give the wrong significance to parts of the page.

In order to use a validation tool to trap errors the first thing to decide is which standard you are going to follow. The most popular, partially because it is the most flexible is 4.01 Transitional . Add a Doctype element at the start of each page to tell your clients, web browsers, validation and tools and potentially robots (although they almost certainly don't use this information) that you are following this standard:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

Some web authoring tools do this for you. Once your page is online you can go the the W3C's validator at

http://validator.w3.org/

and enter the address of your page. A common error is missing ALT text from images although some web authoring tools produce very poor code.

Apart from indicating to search engine algorithms the structure of your page, validated HTML is easier for third parties such as web design companies and search engine optimization consultants to work on. It will also make the code more portable between different web authoring packages.

Although many sites rank well without adhering to standards, validation should help protect your site from changes to the way browsers and search engines work. Certainly if you are optimizing code professionally it should adhere to one of the standards.

On a related topic some web developers and optimizers like to get an idea of how their snazzy page will render on simple browsers. If validated HTML is used along with style sheets for formatting, graphically interesting pages shouldn't present any problems. The Lynx web browser is often used for this purpose as it is a text only web browser. If you don't want to install Lynx the following website

http://www.delorie.com/web/lynxview.html

aims to emulate the output and is useful for giving an idea of how readable the content is.

Google

In 1998 it seemed like Silicon Valley and its dot.com phenonoma was the center of the universe. At Stanford University two graduate students: Larry Page, son of a Computer Science professor at Maryland and Sergey Brin, born in Moscow to Russian parents has been working on search engine technology since they met at a conference in 1995.

At that time search engines usually relied on on-page factors to determine the ranking of web pages in their results. Searching was a bit hit-or-miss, particularly for neophytes who had trouble with advanced search forms and their arcane boolean logic. Page and Brin developed a new way of ranking web pages based on the number of inbound-links . Effectively each inbound-link is treated as a vote for the page. Pages with more inbound-links are deemed to be more important. As a further twist the outbound-links (votes) of more important pages are also given more weight. The PageRank algorithm as a way of sifting the wheat from the chaff of search engine results was born.

The results on Stanford's computers were impressive but to launch their service Page and Brin needed venture capital to buy their own hardware and finance further development. They tried to set up a meeting with Andy Bechtolsheim, a co-founder of Sun, another successful Stanford spin-off. Bechtolsheim was interested but was a very busy man, finally they managed to grab a couple of minutes with him in a coffee shop to explain their project, now called Google.

Bechtolsheim immediately offered a check for 100,000 dollars, " we'll work out the details later... ", he told them before rushing off to another meeting. There was one big problem, the check was made out to Google Inc., a company that did not yet exist. Fortunately just the rumour that Bechtolsheim had given money was enough to get other venture capitalists running in Google's direction.

Google was launched out of a building in Menlo Park , California on the 7th of September 1998 . This was the start of the Googleplex. In addition to Bechtolsheim's initial seed money the pair raised a further $1 million dollars. The following summer two investment capital funds would put up another $25million.

At a time when other search engines were polluting their results with paid inclusion and sponsored links and wrapping the search interface inside a " portal " news of Google spread rapidly, especially amongst computer geeks. By the late spring of 2000 Google was answering 18 million queries each day and had become the biggest search engine on the Web. The interface was incredibly simple - just text box, but thanks to PageRank the results were amazingly good, invariably pulling up the requested information or website in the top ten links. Google often indexed the mass of information that had appeared on company websites better than their internal search interface. This ease of use meant that it was becoming a firm favorite with the increasingly large number of non-expert users on the Web. So much so that the term ' googling for information' has entered everyday speech. In a major coup Yahoo!, the largest website around, decided to use Google to supply its own search results.

Google has sought to extend the reach of its technology to include news sources, weblogs and personal email. The introduction of gmail proved controversial with privacy campaigners who believe that indexing personal email represents an intrusion of privacy. Its Froogle (frugal - groan! ) search engine provides price comparisons for users searching for products.  Google is built using off the shelf PC hardware configured in a cluster of over 10,000 machines (some people have think they have closer to 50,000 machines) running a customized version of the Linux operating system. On the same day that Microsoft announced the beta launch of its new algorithmic search engine Google doubled the number of pages in their index to a claimed 8 billion pages. Some of these pages may not be fully indexed and return results based on URL and title keywords. This showed that Google was capable of rolling-out changes to its massive array of computers and that it was not afraid to play the marketing game with Microsoft.

For a long time Google resisted the urge to go public. Larry and Serge's leitmotif for the company is ' never do anything evil '. While some people question this the company is viewed a lot more favorably than that other behemoth of computing, Microsoft. Google instead concentrated on building up revenue streams, from selling its search technology to the more recent introduction of the Adwords and Adsense programs. The former lets websites bid on keywords, winners have a small advert placed on the same page as relevant search results or on pages of partners in the Adsense programme. Overall this form of unobtrusive advertising has been well received although some Web users run blocker software on their web browser. Post IPO Google posted a doubling of profits and revenues both of which show strong growth due to online advertising.

Freshbot Deepbot

Google uses two robots to crawl web content , these have been dubbed the f reshbot and d eepbot after their general purpose. deeptbot is the once a month deep crawl of web content that results in the main Google index. The freshbot crawls the web on a continuous basis and is responsible for the Everflux effect. It finds content that is updated frequently such as news sites , forums, blogs and other websites. It appears that when Google finds a new page it checks it frequently at first to see if there are regular updates. If there are the site is added to the list of pages to be visited by the freshbot.

The freshbot results appear to be compiled into a separate database. This is overwritten every time the freshbot starts a new cycle. The freshbot and main index are merged to produce search results. The means that fresh content may appear in search results very quickly but then disappear only to resurface one or two months later in the main Google index. If the page is already in the main index the Freshbot results may appear for a few days before reverting to the older version until the site is crawled by the Deepbot.

At one time the Freshbot used internet addresses beginning with the number 64. and Deepbot addresses beginning with 216.; but since around the middle of 2003 the Google robots all come from machines in the 64.* or 66.* address range.

Googledance
Figure 1: Google Robots

Figure 1 is taken from the logfiles of a website , it clearly illustrates a cycle of deep crawls with a lesser number of daily visits. The Googledance follows sometime after the deep crawl as the results are processed.

Page Rank

Page Rank is the concept that enabled Google to leap-frog over other search engines during the late 1990s in terms of relevant search engine results pages and speed of search. Page Rank holds US patent number 6,285,999 granted on the 4th September 2001 to Larry Page, one of Google's founders. Page's idea is based on established practice for scientific papers where the importance of a paper is based on the number of citations made to it by other papers. The Google algorithm interprets a link from one page to another as a vote. The Google algorithm and the configuration of web pages means that page rank can be fed back so that pages that have more votes are then deemed to be more important and the votes they cast (outbound-links) are subsequently given more weight.

Many web designers' efforts to optimize their sites for search engines , if they make any efforts at all, stop at  acquiring inbound-links and exchanging links with other websites in the hope that this will boost their site's PageRank. This focus on PageRank is probably due in part to the Google toolbar that can be installed on Microsoft's Internet Explorer web browser. This shows a PageRank value for the current site on a scale of 0 to 10.

Getting a good site ranking is not quite as simple as getting a good score on the Google toolbar and decisions about outbound-links and site structure can have significant effects on PageRank.

Fortunately because of the patent the theory behind PageRank is known. The formula is relatively simple:

    Page Rank Formula

Where

∂       damping factor, given as 0.85
n       total number of inbound links
PR n     page rank of inbound page n
C n       number of outbound links from page n

So the PageRank (PR) for any page is equal to some constant value (0.15) plus the sum of all the page ranks from inbound links divided by the number of links (C) on each corresponding page multiplied by the damping factor (0.85).

Don't worry if that is too hard to visualize, we will look at some real examples below. One more thing before we do. As pages can, and usually do, feedback to other pages by way of hyperlinks the calculation has to be made iteratively. Every time Google visits a page it recalculates the PageRank based on the weights of the inbound links. Assuming no other changes it takes around 40 iterations for the figure to converge to a stable value although Google itself uses linear algebra to reduce the number of calculations.  The damping factor (∂) is set at 0.85 for optimum performance.

Page Rank Theoretical

The Google company was the first company to patent the system of taking into account inbound links. The algorithm was named Page Rank. In this section, we will describe this algorithm and how it can influence search result ranking.

   Page Rank is estimated separately for each web page and is determined by the Page Rank (citation) of other pages referring to it. It is a kind of “virtuous circle.” The main task is to find the criterion that determines page importance. In the case of Page Rank, it is the possible frequency of visits to a page.

   I shall now describe how user's behavior when following links to surf the network is modeled. It is assumed that the user starts viewing sites from some random page. Then he or she follows links to other web resources. There is always a possibility that the user may leave a site without following any outbound link and start viewing documents from a random page. The Page Rank algorithm estimates the probability of this event as 0.15 at each step. The probability that our user continues surfing by following one of the links available on the current page is therefore 0.85, assuming that all links are equal in this case. If he or she continues surfing indefinitely, popular pages will be visited many more times than the less popular pages.

   The Page Rank of a specified web page is thus defined as the probability that a user may visit the web page . It follows that, the sum of probabilities for all existing web pages is exactly one because the user is assumed to be visiting at least one Internet page at any given moment.

   Since it is not always convenient to work with these probabilities the Page Rank can be mathematically transformed into a more easily understood number for viewing. For instance, we are used to seeing a Page Rank number between zero and ten on the Google Toolbar.

    According To the Ranking Model Described Above :
   - Each page on the Net (even if there are no inbound links to it) initially has a Page Rank greater than zero, although it will be very small. There is a tiny chance that a user may accidentally navigate to it.
   - Each page that has outbound links distributes part of its Page Rank to the referenced page. The Page Rank contributed to these linked-to pages is inversely proportional to the total number of links on the linked-from page – the more links it has, the lower the Page Rank allocated to each linked-to page.
   - Page Rank A “damping factor” is applied to this process so that the total distributed page rank is reduced by 15%. This is equivalent to the probability, described above, that the user will not visit any of the linked-to pages but will navigate to an unrelated website.

   Let us now see how this Page Rank process might influence the process of ranking search results. We say “might” because the pure Page Rank algorithm just described has not been used in the Google algorithm for quite a while now. We will discuss a more current and sophisticated version shortly. There is nothing difficult about the Page Rank influence – after the search engine finds a number of relevant documents (using internal text criteria), they can be sorted according to the Page Rank since it would be logical to suppose that a document having a larger number of high-quality inbound links contains the most valuable information.

   Thus, the Page Rank algorithm "pushes up" those documents that are most popular outside the search engine as well.

Google Page Rank

Currently, Page Rank is not used directly in the Google algorithm. This is to be expected since pure Page Rank characterizes only the number and the quality of inbound links to a site, but it completely ignores the text of links and the information content of referring pages. These factors are important in page ranking and they are taken into account in later versions of the algorithm. It is thought that the current Google ranking algorithm ranks pages according to thematic Page Rank. In other words, it emphasizes the importance of links from pages with content related by similar topics or themes. The exact details of this algorithm are known only to Google developers.

   You can determine the Page Rank value for any web page with the help of the Google Toolbar that shows a Page Rank value within the range from 0 to 10. It should be noted that the Google Toolbar does not show the exact Page Rank probability value, but the Page Rank range a particular site is in. Each range (from 0 to 10) is defined according to a logarithmic scale.

    Here is an example: each page has a real Page Rank value known only to Google. To derive a displayed Page Rank range for their Toolbar, they use a logarithmic scale as shown in this table
          Real PR                               Toolbar PR
          1-10                                            1
          10-100                                        2
          100-1000                                    3
          1000-10.000                               4
Etc.

   This shows that the Page Rank ranges displayed on the Google Toolbar are not all equal. It is easy, for example, to increase Page Rank from one to two, while it is much more difficult to increase it from six to seven.

   In practice, Page Rank is mainly used for two purposes:

   1. Quick check of the sites popularity. Page Rank does not give exact information about referring pages, but it allows you to quickly and easily get a feel for the sites popularity level and to follow trends that may result from your seo work. You can use the following “Rule of thumb” measures for English language sites: PR 4-5 is typical for most sites with average popularity. PR 6 indicates a very popular site while PR 7 is almost unreachable for a regular webmaster. You should congratulate yourself if you manage to achieve it. PR 8, 9, 10 can only be achieved by the sites of large companies such as Microsoft, Google, etc. Page Rank is also useful when exchanging links and in similar situations. You can compare the quality of the pages offered in the exchange with pages from your own site to decide if the exchange should be accepted.

   2. Evaluation of the competitiveness level for a search query is a vital part of seo work. Although Page Rank is not used directly in the ranking algorithms, it allows you to indirectly evaluate relative site competitiveness for a particular query. For example, if the search engine displays sites with Page Rank 6-7 in the top search results, a site with Page Rank 4 is not likely to get to the top of the results list using the same search query.

   It is important to recognize that the Page Rank values displayed on the Google Toolbar are recalculated only occasionally (every few months) so the Google Toolbar displays somewhat outdated information. This means that the Google search engine tracks changes in inbound links much faster than these changes are reflected on the Google Toolbar.

Keyword Density

Keyword page density is a measure of the relative frequency of the word in the text expressed as a percentage. For example, if a specific word is used 5 times on a page containing 100 words, the keyword density is 5%. If the density of a keyword is too low, the search engine will not pay much attention to it. If the density is too high, the search engine may activate its spam filter. If this happens, the page will be penalized and its position in search listings will be deliberately lowered. The optimum value for keyword density is 5-7%. In the case of keyword phrases, you should calculate the total density of each of the individual keywords comprising the phrases to make sure it is within the specified limits. In practice, a keyword density of more than 7-8% does not seem to have any negative seo consequences. However, it is not necessary and can reduce the legibility of the content from a user’s viewpoint.