Tuesday, August 26, 2008

Using Reg Ex to identify Strings representing int/float


Write a Java program using RegEx to identify a String representing int/float (Whitespaces allowed)


Some of you may think why to use Reg Ex to identify Strings reprenting numbers as Java has other options as including the well known Scanner class introduced in Java 5.0, but the point here is not to discuss if we really need to use Reg Ex instead the idea is to make you familiar with usage of Regular Expressions in Java.


If you are comfortable with Regular Expression then you can straightaway go to the Java code doing the required task - Java Program to test Strings using RegEx >>. But, if you want to have a look at how the regular expressions were built then please proceed with this article. The article gives a step-by-step explanation of how to build Regular Expressions to validate Strings representing int or float values - with or without Whitepaces. In this article I'm considering only Blank Spaces as Whitespaces. For other whitespaces you may simply need to tweak the code a bit.


Regular Expression for testing Strings containing int - we will first try to build Regular Expressions for Basic int processing which will be able to validate only those Strings which contain a valid int value without any Whitespaces. Then we'll move on to build a regular expression which will be able to validate Strings containing int values having leading or trailing Whitespaces. Finally we'll move on to build a regular expression which will be able to validate any String representing an int value with leading, trailing and/or embedded Whitespaces.


RegEx for basic int processing: in this case were are considering the String doesn't contain any Whitespaces. Find below the code-snippet which follows an explanation of every bit of it

...

String intString = "-100";

Pattern patternForIntWithoutWSH = Pattern.compile( "((-?+)(\\+{0,1}+))([0-9]+)" );

Matcher mi = patternForIntWithoutWSH.matcher(intString);

boolean result = mi.matches(); // ... returns true

...


A valid int value can contain only two non-digit characters - a mandatory minus sign (-) for negative int valus and an optional plus sign (+) for positive int values. And any of the two non-digit characters will always precede the actual int value which will be a sequence of one or more digits. We'll divide the task into multiple sub-tasks and form the RegEx for all those sub-tasks which will be used to form the actual RegEx. The sub-tasks in this case are:-

  • Identify occurrences of minus (-) or plus (+) signs: This sub-task can again be broken down into three sub-sub-tasks: (i) to ensure that if a minus sign occurs it occurs only once - the Reg Ex for this will be -?+ (ii) to ensure that a valid positive int can either have one and only one plus sign or no sign at all - the Reg Ex for this will be \+{0,1}+ (iii) to ensure that either sub-sub-task (i) happens or sub-sub-task (ii) happens in case and not both - the Reg Ex in this case will be (-?+)(\+{0,1}+)
  • Identify any digit: The Reg Ex will be [0-9]
  • Identify a sequence of one or more digits: The Reg Ex in this case will be [0-9]+ and NOT [0-9]* as the latter will allow an empty String or a String having only -/+ signs

Now that we have regular expressions of the sub-sub-tasks then we can easily form the complete regular expression which can test Strings having basic int values and this Regular Expression will be: ((-?+)(\+{0,1}+))([0-9]+)


RegEx for int values having Leading and/or Trailing Blanks:


...

String iswlotw = " - 1007 ";

Pattern patternForIntWithWSH = Pattern.compile( "([ ]*)((-?+)(\\+{0,1}+))([ ]*)([0-9]+)([ ]*)" );

Matcher mi1 = patternForIntWithWSH.matcher(iswlotw);

...


It's easy to understand that we can use [ ]* to represent any number of Blank Spaces. Note that these spaces can be either leading or trailing to the actual int value. These spaces can also precede and trail the minus (-) or plus (+) sign and that way it can be embedded between minus/plus sign and the actual int value. That means we'll require to add [ ]* in the above Regular Expression before and after every single logical unit identifying a valid character and this will get us the revised regular expression as: ([ ]*)((-?+)(\+{0,1}+))([ ]*)([0-9]+)([ ]*)


RegEx for int values having Embedded Blanks as well:


...

String iscomplete = " - 1 0 ";

Pattern patternForIntWithCompleteWSH = Pattern.compile( "([ ]*)((-?+)(\\+{0,1}+))([ ]*)([0-9]+)([ 0-9 ]*)([ ]*)" );

Matcher mi2 = patternForIntWithCompleteWSH.matcher(iscomplete);

...


the previous regular expression will not work for int values having embedded blanks like "-1 890". To make this capable of identifying such String as well we will require to replace the part "([0-9]+)" with "([0-9]+)([ 0-9 ]*)". Here the presence of ([0-9]+) ensures that at least one digit is always present. Why can't only ([ 0-9 ]+) work here? Because this will return true for a String (for example: " - ") having only minus/plus sign and blank spaces which of course is incorrect. Though it'll correctly return false for these Strings: "-", "+", "". The final regular expression for identifying any int value represented by a String will be: ([ ]*)((-?+)(\+{0,1}+))([ ]*)([0-9]+)([ 0-9 ]*)([ ]*)


Regular Expression for identifying float values: regular expression of identifying float values should be able to identify the occurrence of zero or one decimal point (.) in addition to identifying a valid int value before and after that decimal point (if it exists). Another point to note here is that the minus (-) or plus (+) sign (if at all any one of the two exist) can only precede the int value before the decimal point. Find below the code snippet:-

...

String floatString = " -1 007 . 0 5 ";

Pattern patternForFloat = Pattern.compile( "([ ]*)((-?+)(\\+{0,1}+))([ ]*)([0-9]+)([ 0-9 ]*)([ ]*)((\\.)([ ]*[0-9]+))?+([ ]*)([ 0-9 ]*)([ ]*)" );

Matcher mf = patternForFloat.matcher(floatString);

...


((\\.)([ ]*[0-9]+))?+ part of the above regular expression is actually doing the identification of a possible decimal point ( the character '.' has a special meaning and hence escape character '\' is preceding this) and if the decimal point exists then we need to ensure that at least one digit follows it and that responsibility is being carried off by ([ ]*[0-9]+) sub-part.


In case of floating value identification we need to ensure that the decimal point if exists then it is preceded as well as followed by at least one digit each. Strings like " - 9 . ", " - .8 ", " - . ", etc. should not be allowed. This has been taken care by adding ([0-9]+) before and after the decimal point check. Before the decimal-point check we anyway need it as a float processing regular expression should successfully identify an int value as well. A logical AND of (\\.) and ([0-9]+) ensures that a mandatory digit will be checked only if a decimal point exists. I hope we all can now move to the complete code listing - Java program using Reg Ex to identify int/float >>


Liked the article? You may like to Subscribe to this blog for regular updates. You may also like to follow the blog to manage the bookmark easily and to tell the world that you enjoy GeekExplains. You can find the 'Followers' widget in the rightmost sidebar.



Share/Save/Bookmark


6 comments:

Anonymous said...

You have explained the process of building regular expressions very well. I had tried writing this program in past, but I didn't think of these many scenarios. Keep posting many more code-based articles.

Geek said...

Hello Visitors,

Somebody is trying to defame this blog by spamming Sun Java Forums. I've just received couple of Anonymous emails.

I just checked the forums and found that there is an UderID named 'geekexplains'. This is probably because Sun Java Forums doesn't require Email ID verification and hence some guy created this ID maliciously.

Whosoever is this, please stop doing that. I believe you'll realize your mistake and stop this. I'll anyway try to contact Sun Support regarding this and this may help them getting a Email Verification System in place.

Thanks,
Geek

Anonymous said...

Hey Geek... That's really sad. But, I'll advise you to not be bothered about such things.

GeekExplains blog is doing great and many visitors (including me) are getting immensely benefited. Please don't be worried about such incidents and keep writing nice articles.

People with such intentions can't succeed for sure. I hope the actual guy also realizes his mistake and will stop that.

Keep Rocking!
Priyank Varma

Manish Rungta said...

Hi Geek,

This is a great and useful blog. Please do not get distracted by all these bullshit. Some people have nothing to do but stealing others' thunders.

Please keep up the good work

Yogi-at Meditation! said...

Hi Geek,
It's really sad that some one is trying to affect your earnest efforts for good of java newbies (and seasoned too).

But what I would suggest that you just continue doing good work without getting demoralized by such incidents.

-Yogesh Chaturvedi

Geek said...

@Ryan: thanks for the appreciation. Keep visiting/posting!

@Priyank, @bottledup, @Yogesh: thank you all for all the kind words. I've left that incident behind and focused on the main task - to make this blog better and richer which is not possible without the support of all our visitors. Keep visiting/posting/rocking! (taking a leaf out of Priyank's comment :-))