PHP Search Engine Code

The right search engine on your website won’t bring you more traffic but it will help your visitors to better locate things so it will keep them on your pages. A good search engine implemented can increase your hits with almost 30% and this is tested. I’m not talking just to have a tutorial.  This code will help you find files in Windows 10 and Window 7 PCs and Laptops.

There are many ways of doing this. You can go buy a software that will scan your website and create static pages with the results. I don’t imagine myself ever using this kind of search but may be suitable for some websites. Another way would be to signup with google and implement their own search in your pages. The results will be shown by google but I wouldn’t use this one either…who knows how long it takes until google comes to crawl my website for new stuff and how long until it will display the results.

Another way is to have a good php search engine that will interact with your database and scan the desired tables for a given string. I like this one because it gives you the power to be precise, to modify things to your own needs, to display to the visitor exact data or not etc. Ok…php search…easy you may say. Not until I will show you what to avoid because, if you do it wrong, it’s useless. Completely useless!

A simple word like “AND” changed to another one like “OR” can have a massive impact on your results because that’s what we’re trying to bring here…results and not just any of them but relevant ones.

This will be a long tutorial and I hope I will be able to be very transparent and make it easy to follow. I recently worked to implement dynamic search on a website and there is a big number of operations to deal with.
I knew it will be a long and hard job so I’ve searched a little on google for pre-built scripts to do the same thing and save me the time but I didn’t find any good results. Only crap. Simple stuff that won’t bring results and I’ll show you why. I will build this tutorial by comparing the simple methods with the ones that will be described…ones yes because there are more than one good solutions to display relevant results.

Lets’s start with a simple example where the string we’re going to search is the word “form”. We have a table with an id, title, subtitle and content where we store our articles. We want to confront the database with the string. We are going to store our string in a variable:

  1. $string = “form”;

Next we will have to build the recordset in order to display our results.

  1. $sql = “SELECT * FROM articles WHERE title LIKE ‘$string’ AND subtitle LIKE ‘$string’ AND content LIKE ‘$string’ ORDER BY id DESC”;
  2. $query = mysql_query($sql) or die(mysql_error());
  3. $row_sql = mysql_fetch_assoc($query);
  4. $total = mysql_num_rows($query);
  5. if($total>0) {
  6.         while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
  7.         echo .$row_sql[‘title’].‘<br />’.$row_sql[‘subtitle’].;
  8.         }
  9. } else
  10.         {
  11.         echo “No results to display”;
  12. }

This is the simplest method and, as it is, it will probably fail to deliver any good results. Why? Well…let’s take that SQL query and analyze it. We’ve compared the $string variable with 3 rows from our table and we used the condition “AND” so, as a result, the database will be forced to find and article that has the word “form” in all of the 3 rows (title, subtitle and content). If no match will be found, we prepared an error.

Conclusion for this method:

– AND forces our database to find an article where all our selected rows MUST contain our string
– the string must match exactly (form will have to be form and not forms which is the plural for form but still relevant)

In a response to this very simple method which brings almost 0 results we are starting to look for other ways of searching our database. Let’s digg deeper a little bit and change some facts. Let’s take our query and replace “AND” with “OR”. A huge change. The database will have to find a match in any of the 3 rows (title, subtitle or content) so, we can say, that we managed to multiply our results by 3 just by replacing “AND” with “OR” and, most probably, we’re starting to see some results.

The new query will be:

  1. SELECT * FROM articles WHERE title LIKE ‘$string’ OR subtitle LIKE ‘$string’ OR content LIKE ‘$string’ ORDER BY id DESC

Conclusion for this method:

– OR had a good effect and we found more results to display
– we’re still looking for words to exactly match our string (“forms” compared to “form” won’t be displayed as a result but forms is the plural for form so it can be relevant)

We have our 2nd conclusion so we have to do something with this new method of searching to find more results that are also relevant. Not just exact matches. Let me introduce you to our new buddy “%”. What can he do? Lots of things, he’s the one with the magic tricks that will take our query to a new level. The percent sign is a wildcard character so that the LIKE string can be extended in order to match multiple records.

The new query will be:

  1. SELECT * FROM articles WHERE title LIKE ‘%’.$string.‘%’ OR subtitle LIKE ‘%’.$string.‘%’ OR content LIKE ‘%’.$string.‘%’ ORDER BY id DESC

Our database will display results that contain the word form wherever he sits in another word. “forms” will be a results, “forming” will be a result, “formatting” will also be a result.

‘.$string.’%’ = only results that start with “form”
‘%’.$string.’%’ = results that start or end or contain “form”
‘%’.$string.” = only results that end with “form”

As you’ve probably noticed I’m building this tutorial on levels so here’s the next one. What if…our visitor will search for “forms” and not “form”? “form” will not pass as a result and will not be presented by our database even if we used all the above methods. Don’t worry, we have a solution. It’s time to prepare our data before using it in our query (I don’t recommend you to do queries with direct data from user input without checking it first…EVER).

It’s time for you to find out about The Porter Stemmer algorithm.

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

There are 2 classes (or more!?) available for PHP. One belongs to Richard Heyes and it requires PHP5 and the other one that we will be using belongs to Jon Abernathy. What is does?!

It will basically takes a word, or list of words, and reduces them to their English stems. “forms” will probably become “form” so we’re back in business. Searching for “forms” will bring “form” as a result too. Here’s were we are so far:

  1. include(‘stemmer.class.inc’);
  2.         $stemmer = new Stemmer;
  3.         $stemmed_string = $stemmer->stem($string);
  4.         $sql = “SELECT * FROM articles WHERE title LIKE ‘%’.$stemmed_string.’%’ OR subtitle LIKE ‘%’.$stemmed_string.’%’ OR content LIKE ‘%’.$stemmed_string.’%’ ORDER BY id DESC”;

If you think this would bring back too many results I recommend you to use the stemmed string only as a prefix for what we’re searching.

  1. $sql = “SELECT * FROM articles WHERE title LIKE ”.$stemmed_string.’%’ OR subtitle LIKE ”.$stemmed_string.’%’ OR content LIKE ”.$stemmed_string.’%’ ORDER BY id DESC”;

Both methods are good. It just depends on your database size and who knows, whatever you’re thinking…might work.

We will also need to transform our string to lowercase for more accurate results.

  1. $stemmed_string = $stemmer->stem(strtolower($string));

You think we’re done?! Not at all…we’re at the half of our tutorial.

What if…our visitor will search for “tableless forms” for example. The database will return the articles that contain the string but not the articles that will contain at least one word from our string, not even the articles that contain the word “tableless” at the start of the content and the word “forms” at the end or middle of it. What should we do?! Force the visitor to enter only one word per search in order to give him a relevant result. Not at all. We will split our string in an array and make a loop within our query in order to find results that may contain “forms” in the title, subtitle or content but not the word “tableless”. Here’s the new face of our code.

  1. $split = split(” “,$stemmed_string);
  2.         foreach ($split as $array => $value) {
  3.         $new_string .= $value;
  4.         }

The above array is used to split the string in words and stem each and every one of them (I’ve tried to stem the string and it managed to do the trick with only the first word from a given string so we’re building an array to help us out and stem each word separately).

This code might look good but it’s not. From “tableless forms” we will have “tableform” which represents the two words stemmed but unified. We will modify our array in order to add an extra space between words but to also eat up the last space: “table form” instead of “table form “. Here’s the new array:

  1. $split = split(” “,$stemmed_string);
  2.         foreach ($split as $array => $value) {
  3.         $new_string .= .$value.‘ ‘;
  4.         }
  5.         $new_string=substr($new_string,0,(strLen($new_string)-1));

Now we will have to extend our query to find results that match any of our new stemmed words. We will build a query that will look in the title or subtitle or content of our articles for table or form. More results to come yes…

  1. $split_stemmed = split(” “,$new_string);
  2.         $sql = “SELECT DISTINCT * FROM articles WHERE (“;
  3. while(list($key,$val)=each($split_stemmed)){
  4.         if($val<>” “ and strlen($val) > 0){
  5.         $sql .= “(title LIKE ‘%’.$val.’%’ OR subtitle LIKE ‘%’.$val.’%’ OR content LIKE ‘%’.$val.’%’) OR”;
  6.         }
  7. }
  8.         $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR
  9.         $sql .= “) ORDER BY id DESC”;

This new query will have a lot more work to do but will also find more results. The generated query will look similar to this:

  1. SELECT DISTINCT * FROM articles WHERE ((title LIKE ‘%table%’ OR subtitle LIKE ‘%table%’ OR content LIKE ‘%table%’) OR (title LIKE ‘%form%’ OR subtitle LIKE ‘%form%’ OR content LIKE ‘%form%’)) ORDER BY id DESC

I’m sure you can see the difference. The script will now produce a new set of queries for each stemmed word in part.

We may also user RLIKE instead of LIKE but it’s not recommended for mid-sized databases and up because it needs much more resources from the server to perform the queries.

There is something more to add to this method I’ve tutorialized above. Let’s suppose that our visitor is not the keyword kind of guy and enters the following string: “I want to find tutorials about tableless forms”. Woooohooooo….I’m glad I’m not your server 🙂 Can you imagine the SQL??!?! Let’s work on that.

We will first eliminate all the words that are based on 3 or less letters so let’s extend the array where we create the stemmed words.

  1. $split = split(” “,$stemmed_string);
  2.         foreach ($split as $array => $value) {
  3.                 if (strlen($value) > 3) {
  4.                 $new_string .= .$value.‘ ‘;
  5.                 }
  6.         }
  7. $new_string=substr($new_string,0,(strLen($new_string)-1));

After this there is still work to do. We need to eliminate punctuations and common words like “find”, “about”…etc. We can build a class to do the job for us.

  1. class Cleaner {
  2.         var $stopwords = array(” find “, ” about “, ” me “, ” ever “, ” each “)//you need to extend this big time.
  3.         var $symbols = array(‘/’,\\,\’,‘”‘,‘,’,‘.’,‘<‘,‘>’,‘?’,‘;’,‘:’,‘[‘,‘]’,‘{‘,‘}’,‘|’,‘=’,‘+’,‘-‘,‘_’,‘)’,‘(‘,‘*’,‘&’,‘^’,‘%’,‘$’,‘#’,‘@’,‘!’,‘~’,‘`’  );//this will remove punctuation
  4.         function parseString($string) {
  5.                 $string = ‘ ‘.$string.‘ ‘;
  6.                 $string = $this->removeStopwords($string);
  7.                 $string = $this->removeSymbols($string);
  8.                 return $string;
  9.         }
  10.         function removeStopwords($string) {
  11.                 for ($i = 0; $i < sizeof($this->stopwords); $i++) {
  12.                         $string = str_replace($this->stopwords[$i],‘ ‘,$string);
  13.                 }
  14.                 //$string = str_replace(‘  ‘,’ ‘,$string);
  15.                 return trim($string);
  16.         }
  17.         function removeSymbols($string) {
  18.                 for ($i = 0; $i < sizeof($this->symbols); $i++) {
  19.                         $string = str_replace($this->symbols[$i],‘ ‘,$string);
  20.                 }
  21.                 return trim($string);
  22.         }
  23. }

Save it as cleaner.php for example and get ready to take your scripts to the…of course..NEXT LEVEL. Clean, stemmed, without punctuation, stop words whatsoever. Here’s the result that’s almost our final one:

  1. include(‘stemmer.class.inc’);
  2. include(‘cleaner.php’);
  3.         $stemmer = new Stemmer;
  4.         $stemmed_string = $stemmer->stem($string);
  5.         $clean_string = new cleaner();
  6.         $stemmed_string = $clean_string->parseString($stemmed_string);
  7.         $split = split(” “,$stemmed_string);
  8.               foreach ($split as $array => $value) {
  9.               if (strlen($value) > 3) {
  10.               continue;
  11.               }
  12.               $new_string .= .$value.‘ ‘;
  13.               }
  14.         $new_string=substr($new_string,0,(strLen($new_string)-1));
  15.         $split_stemmed = split(” “,$new_string);
  16.         $sql = “SELECT DISTINCT * FROM articles WHERE (“;
  17.         while(list($key,$val)=each($split_stemmed)){
  18.               if($val<>” “ and strlen($val) > 0){
  19.               $sql .= “(title LIKE ‘%’.$val.’%’ OR subtitle LIKE ‘%’.$val.’%’ OR content LIKE ‘%’.$val.’%’) OR”;
  20.               }
  21.         }
  22.               $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR
  23.               $sql .= “) ORDER BY id DESC”;
  24.         $query = mysql_query($sql) or die(mysql_error());
  25.         $row_sql = mysql_fetch_assoc($query);
  26.         $total = mysql_num_rows($query);
  27.         if($total>0) {
  28.                 while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
  29.                 echo .$row_sql[‘title’].‘<br />’.$row_sql[‘subtitle’].;
  30.                 }
  31.         } else
  32.                 {
  33.                 echo “No results to display”;
  34.         }

We now have a clean search, stemmed keywords for best results but what about the relevant results that must sit on top of the other ones. How do we give weight to our results based on occurrences (occurrences is self explanatory I believe). If there are 100 results about “form” but only one or maybe 2 of them are really relevant to what we’re searching what do we do to bring it closer to the first results that our database retrieves and not just somewhere between pages based on what…id descending maybe or who knows how you order your results. What we need to do is store the number of occurrences directly from the SQL query. Yeah I know, you thought we’re going to build a new array (it could also be a solution) but we actually have more than one solutions to work with directly from the SQL query. How do we store that number? well…look at the new query:

  1. $sql = “SELECT DISTINCT COUNT(*) As occurrences, id, title, subtitle, content FROM articles WHERE (“;
  2.         while(list($key,$val)=each($split_stemmed)){
  3.               if($val<>” “ and strlen($val) > 0){
  4.               $sql .= “(title LIKE ‘%’.$val.’%’ OR subtitle LIKE ‘%’.$val.’%’ OR content LIKE ‘%’.$val.’%’) OR”;
  5.               }
  6.         }
  7.               $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR
  8.               $sql .= “) GROUP BY id ORDER BY occurrences DESC”;

The other way of doing this is easyer but leaves you with less choices on how to display results and what to display.

It’s called the “Boolean Full-Text Searches” which will figure out by itself which results are relevant, what is a common word, stop word etc…Will basically do the same thing as our script above from a single query but, as I said, it will do it by itself. I recommend this to mid-size to bigger databases because it brings weird results when used against small ones.

For example if you have only 3 articles and all 3 of them contain the word “form” and yet our user is searching for the word “form”, this method will find the results as non-relevant due to the fact that it’s present in all articles so it’s a 100% and taken as common word, stop word. Everything that goes beyond 50% will be ignored due to the same policy.

It’s better because it needs less resources, less code and it’s also capable of ordering the results by relevancy from most to less directly from the SQL statement so it’s a hell of a smaller code. For further reading on this method I recommend you to read the Reference Manual from MySql.

Using this method we have the code and the and of this tutorial and thank you for reading this.

  1. $sql = “SELECT DISTINCT MATCH(title, subtitle, content) Against (‘$string’ IN BOOLEAN MODE) as score, title, subtitle, content FROM articles WHERE MATCH(title, subtitle, content) Against (‘$string’ IN BOOLEAN MODE) ORDER BY score DESC”;
  2. $query = mysql_query($sql) or die(mysql_error());
  3. $row_sql = mysql_fetch_assoc($query);
  4. $total = mysql_num_rows($query);
  5. if($total>0) {
  6.         while ($row_sql = mysql_fetch_assoc($query)) {//echo out the results
  7.         echo .$row_sql[‘title’].‘<br />’.$row_sql[‘subtitle’].;
  8.         }
  9. } else
  10.         {
  11.         echo “No results to display”;
  12. }

We could also use MATCH() Against instead of SELECT COUNT() for a faster result like this:

  1. $sql = “SELECT DISTINCT MATCH(title, subtitle, content) Against (‘$string’ IN BOOLEAN MODE) as occurrences, id, title, subtitle, content FROM articles WHERE (“;
  2.         while(list($key,$val)=each($split_stemmed)){
  3.               if($val<>” “ and strlen($val) > 0){
  4.               $sql .= “(title LIKE ‘%’.$val.’%’ OR subtitle LIKE ‘%’.$val.’%’ OR content LIKE ‘%’.$val.’%’) OR”;
  5.               }
  6.         }
  7.               $sql=substr($sql,0,(strLen($sql)-3));//this will eat the last OR
  8.               $sql .= “) ORDER BY occurrences DESC”;

What’s the difference. I already told you. MATCH() is faster but it’s kinda useless if your database is small to mid-size while COUNT() takes longer to decide what’s the score of relevancy between records but might suit your small to mid-size database better. It’s really up to you to decide what’s best for your pages. I built this tutorial offering you various choices. The ball is on your field now.

Leave a Comment!