Saturday 30 April 2016

Android : A lesson on Regular Expressions by examples


Every one who has ever worked on Regular Expressions knows tricky it can be (if you don't understand it completely!). And as frustrated and angry as we might be, we can't ignore how powerful regular expressions are. Now let me put this straight first. I don't plan to pretend here that I am an expert on Regular Expressions. In this blog I am going to share with you what I have learned after spending countless hours in frustration. I am going explain Regular Expressions with practical examples that you may face at your work. So if you want to brush up your memory on RegEx characters\symbols you can check my blog on Regular Expressions (although that blog was for PowerShell, symbols have same meaning). Or there are some excellent articles on Regular Expressions that you can go through first like this.

Now that you have basics let's see some practical examples :

Let's say you have to find a name from a given string. Now we know names start with uppercase character. So our question boils down to this : Write a regular expression to find words that start with uppercase character followed by lowercase characters?

Solution: Now there are multiple ways to achieve this. I will show you 2 ways.
1-  [A-Z][a-z]+
2-  \p{Lu}\p{Ll}+

If you have gone through the 2nd link I shared above carefully which happens to be google's documentation on Pattern you can atleast recognize the strange symbols in second solution.
What first solution tells is find all strings with uppercase characters followed by lowercase characters. It's as simple as that. "+" tells preceding character or group may occur one or more times.
Now second solution is an advanced and cleaner way of doing same thing. \p let's you select characters based on the class name you provide inside curly braces following it. In above example Lu and Ll mean uppercase letter and lowercase letter respectively. Please note in android you have to use extra escape characters. So regex would be something like \\p{Lu}\\p{Ll}+ while compiling it using Pattern class. So our code would look like this :


String test = "this is some random text to test REGEX Asutosh Nayak" 
Pattern pattern_u = Pattern.compile("[A-Z][a-z]+");//\p{Lu}\p{Ll}+
Matcher matcher_t = pattern_u.matcher(test);
String res = "";
int c = 0;
while(matcher_t.find())
{
        res += "Match:"+matcher_t.group();
 }
    res += "\n";
}
textview_test.setText(res);

Also keep in mind Matcher.group() or Matcher.group(0) returns matches for all the groups it found. So if you have groups in your regex and you want to fetch match for only a certain group use Matcher.group(index) where "index" starts from 1. 
To understand how group() works let's see this example : 



String test = "this is some random text to test REGEX Asutosh Nayak"
Pattern pattern_u = Pattern.compile("[A-Z][a-z]+");//\p{Lu}\p{Ll}+
Matcher matcher_t = pattern_u.matcher(test);
String res = "";
int c = 0;
while(matcher_t.find())
{
        res += "Match:"+matcher_t.group();
 }
    res += "\n";
}
textview_test.setText(res);

 

It gives result like this :





As you can see “Group No.1” returned “REGEX” which was found by regex within first pair of parentheses and “Group No.2” returned “Asutosh” which was our second group. For those of who are wondering what happened to “Nayak” note that our regex was for a word with all uppercase characters followed by a word with first character uppercase only.
Neat right? Not so difficult. But this was one of the simplest regular expressions. Now let’s write some RegEx on numbers. Nothing is complete without numbers.

What if you had a string and you wanted to find numbers in it but not just any number. An amount in "Rupees" or "INR". So our regex should be capable of finding numbers preceded by Rs or INR.
Solution : (?i)(?:\s(?:RS|INR)\.?\s?)(\d+(\.\d{1,2})?)
Here is the sample code :


String test = "this is some random text to test REGEX for amount Rs. 911.10. Let's see 909.98."

Pattern pattern_m = 
Pattern.compile("(?i)(?:\\s(?:RS|INR)\\.?\\s?)(\\d+(\\\\d{1,2})?)");          

Matcher matcher_t = pattern_m.matcher(test);

            String res = "";

            int c = 0;

            while(matcher_t.find())

           {

              res += "Match No."+ c++ +"\n";

              for(int i=0; i<=matcher_t.groupCount();i++)

              {

                  res += "  Group No."+i+"\n";

                  res += "    Match:"+matcher_t.group(i)+"\n";

              }

              res += "\n";

           }
            tv_test.setText(res);  

Here is how result looks :




It’s perfectly fine to panic! :-D. I will explain everything.
  • (?i) - This  tells that the regex that follows is case insensitive. So this regex will treat “RS” and “Rs” the same way.  If later you want to add a group to your regex which is case sensitive just add a (?-i) before it.
  • (?:)- This is called non capturing group. What it means is it will search for the patter that’s inside the parenthesis to determine the overall match but it won’t include this pattern in any group. As seen in above image.
  • \.?- ‘?’ is called optional quantifier. It means the character or group preceding it can occur at most once(0 or 1). So here it tells that “Rs” can be followed by a “.”.
  • (\d+(\.\d{1,2})?)- This pattern is used to recognize any decimal number. \d+ means one or more number of digits. So digits followed by pattern for period followed by 1 or 2 (at most) digits which is optional since numbers may not have decimal portion.
That’s all there is to it. These were the confusing symbols in this regex.
Let’s make it even tougher. What if your string has now two numbers and you want to get the ordinary number not the money value. So we have to build a regular expression to find a number which is not preceded by Rs or INR. Sample String : "this is some random text to test REGEX for amount Rs. 911.10. Let's see 909.98 Test."

Solution: (?i)[^(Rs|INR\.?\s?)](\s\d+(\.\d{1,2})?\s?)

Tip: while writing complicated regular expressions always try to start small. Like if you have to find numbers not preceded by Rs or INR try first finding Rs or INR, then find numbers with Rs or INR and then finally negate the Rs or INR group. This will help you find which portion of regex is not working.

Using similar code as above and doing necessary changes to string and regular expression following result can be found :
   



All we did was surround the regex for finding "Rs or INR" with within square brackets and add an "^" to it. "^" is like logical NOT. So it signifies that we want numbers which are not preceded by Rs or INR.

Regular Expressions can be tricky to debug and really frustrating sometimes. But it’s a really powerful tool at our hands to quickly search for a pattern.

No comments:

Post a Comment

Feel free to share your thoughts...