LEARN HOW TO CLEAN DATA WITH PYTHON

Sharing is caring!

Regular Expressions or RegEx 

A Regular Expression or RegEx is a sequence of characters that creates a search pattern. They are used for analyzing the text files for a specific pattern, validating test results, and finding keywords in webpages or emails. 

RegEx Ranges 

RegEx ranges are used to specify a range of characters that can be matched. Some of the common regular expression ranges are:  

  • [0-9]: match any digit 
  • [A-Z]: match any uppercase letter 
  • [a-z]: match any lowercase letter 
  • [A-Za-z]: match any lowercase or uppercase letter 

Wildcards  

Wildcards in RegEx are denoted with the period.  It can match any single character (symbol, letter, number, or whitespace) in a piece of text.  

An example of this would be the regular expression ……… It will match the text marsupial, orangutan, or any other 9-character text. 

Shorthand Character  

Shorthand character classes can help in simplifying the writing of regular expressions. For example, 

  • \w indicates the regex range [A-Za-z0-9_] 
  • \d indicates [0-9] 
  • \W indicates [^A-Za-z0-9_] matching any character not included by \w 
  • \D indicates [^0-9] matching any character not included by \d 

Kleene Star & Kleene Plus  

In Regular expressions, the Kleene star (*) represents that the preceding character can occur 0 or more times. An example for this would be meo*w. It will match mew, meow, meooow, and meoooooooooooow 

The Kleene plus (+) represents that the preceding character can occur 1 or more times. For example, meo+w will match meooow, meoooooooooooow, and meow but not mew. 

Grouping  

Grouping in RegEx is achieved by open (and close parenthesis).  

For example, the RegEx I love (baboons|gorillas) will match the text I love baboons and I love gorillas, as the grouping restricts the reach of the | to the text within the parentheses. 

Optional Quantifiers  

Optional quantifiers are represented by a question mark? in regular expressions. The question mark implies that a character can appear either 0 or 1 time.  

For example, the regular expression humou?r will match the texts humour and humor. 

Character Sets  

Character sets in regular expressions are represented by a pair of brackets []. It will match any characters contained within the brackets. 

An example of this would be the regular expression con[sc]en[sc]us. It will match any of these texts: consencus, concencus, consensus, consensus. 

Literals  

The literals in regular expressions are the simplest characters. It will be matched with the exact text of the literals. 

An example of this would be the regex monkey. The regex monkey will match the text monkey. However, it will also match monkey in the text “The monkeys like to eat apples”. 

Fixed Quantifiers  

Fixed quantifiers are represented by curly braces {}. It includes either the exact quantity or the quantity range of characters to be matched.  

An example of this would be the regular expression roa{3}r . It will match the text roaaar. However, the regular expression roa{3,6}r, will match  roaaaaaar, roaaaar,roaaar, roaaaaar. 

Alternation  

Alternation in regular expressions is specified by the pipe symbol |. It can help in the matching of either of two subexpressions. 

An example of this would be the regex baboons|gorillas. It will match the text baboons and gorillas. 

Anchors  

Anchors (hat ^ and dollar sign $) in regular expressions will match the text at the start and the end of a string.  

For example, the regex ^Monkeys: my mortal friend$ will match the text Monkeys: my mortal friend. However, it will not match the text Spider Monkeys: my mortal friend or Monkeys: my mortal friend in the wild. The ^ ensures that the matched text starts with Monkeys, and the $ ensures the matched text ends with friend.