{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regular expression (re)\n",
    "\n",
    "Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python and made available through the *re* module. Using *re*, one can specify the rules for the set of possible strings that one wants to match. One can match strings in English sentences, or e-mail addresses, or html files. In short, it provides\n",
    "an extremely powerful way for us to do *string matching*.\n",
    "\n",
    "We first import the *re module*. This module has the following features:\n",
    "- *re* provides regular expression tools for advanced string processing.\n",
    "- We can use *re.search()* to see if a string matches a regular expression. Note that the *Return* is a *True* or *False*.\n",
    "- You can use *re.findall()* to extract portions of a string that match your regular expression. Note that the *Return* a list of strings.\n",
    "\n",
    "**Reference**: https://pymotw.com/3/re/\n",
    "\n",
    "**Created and updated by** John C. S. Lui on August 14, 2020.\n",
    "\n",
    "**Important note:** *If you want to use and modify this notebook file, please acknowledge the author.*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Finding patterns in text\n",
    "\n",
    "One common use of *re* is to search for patterns in text. The *search()* function takes the pattern and text to scan, and returns a *Match object* when the pattern is found. If the pattern is not found, search() returns *None*.  Let's look at an example.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first import the regular expression module\n",
    "import re\n",
    "\n",
    "pattern = 'this'\n",
    "text = 'This is really stupid, because this is nut.'\n",
    "\n",
    "match = re.search(pattern, text)    # scan pattern within the text\n",
    "\n",
    "start = match.start()         # note the position of the starting position\n",
    "end   = match.end()\n",
    "\n",
    "#  Let's look at the 'format' output\n",
    "print('Found \"{}\"\\nin \"{}\"\\nfrom {} to {} (\"{}\").'.format(\n",
    "    match.re.pattern, match.string, start, end, text[start:end]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compiling expressions\n",
    "\n",
    "Although *re* includes module-level functions for working with regular expressions as text strings, it is more efficient to compile the expressions a **program uses frequently**. The *compile()* function converts an expression string into a *RegexObject*.  Let's study this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import the re\n",
    "import re\n",
    "\n",
    "# Precompile the patterns we want to search, in this case, they are 'this' and 'that'\n",
    "regexes = [\n",
    "    re.compile(p)\n",
    "    for p in ['this', 'that']\n",
    "]\n",
    "\n",
    "text = 'Does this text match the pattern?'   # this is the text we want to search\n",
    "\n",
    "print('Text: {}\\n'.format(text))\n",
    "\n",
    "for regex in regexes:\n",
    "    print('Seeking \"{}\" ->'.format(regex.pattern), end=' ')  # pattern we want to search\n",
    "\n",
    "    if regex.search(text):\n",
    "        print('We found a match !!!!')\n",
    "    else:\n",
    "        print('Sorry, no match')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How to do **multiple** matches?\n",
    "\n",
    "So far, we can only match the *first* instance of the pattern, what if we want to find all instances?  In this case, we use teh *findall()* function, which returns all of the substrings of the input that match the pattern without overlapping.  Let's take a look."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "pattern = 'this'\n",
    "text = 'This is really stupid, because this is nut, and this is crazy.'\n",
    "\n",
    "for match in re.findall(pattern, text):\n",
    "    print('Found \"{}\"'.format(match))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What if we want to find all possible start and end indexes?\n",
    "\n",
    "We can use the *finditer()* function, which returns an **iterator** that produces Match instances instead of the strings returned by *findall()*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's repeat the above program if we want to find the specific positions of each find\n",
    "\n",
    "import re\n",
    "\n",
    "pattern = 'this'\n",
    "text = 'This is really stupid, because this is nut, and this is crazy.'\n",
    "\n",
    "for match in re.finditer(pattern, text):\n",
    "    start = match.start()\n",
    "    end   = match.end()\n",
    "    print('Found \"{}\" at {}:{}'.format(text[start:end], start, end))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pattern syntax\n",
    "\n",
    "Regular expressions support powerful patterns. Patterns can \n",
    "* repeat\n",
    "* be anchored to different logical locations within the input\n",
    "* be expressed in compact forms \n",
    "\n",
    "All these features are used by combining literal text values with meta-characters that are part of the regular expression pattern syntax implemented by *re*.\n",
    "\n",
    "The following are some examples of *re*\n",
    "- ^  &nbsp; &nbsp; : Matches the **beginning** of a line\n",
    "- $  &nbsp; &nbsp; : Matches the **end** of a line\n",
    "- .  &nbsp;&nbsp; &nbsp; : Matches **any** character\n",
    "- \\s &nbsp; &nbsp;: Matches **whitespace**\n",
    "- \\S &nbsp; &nbsp;: **non-whitespace** character\n",
    "- \\*  &nbsp; &nbsp;&nbsp;: **Repeats** a character *zero or more times*\n",
    "- \\*? &nbsp; : **Repeats** a character *zero or more times (non-greedy)\n",
    "- \\+  &nbsp; &nbsp; : **Repeats** a character one or more times\n",
    "- \\+?  &nbsp;: **Repeats** a character one or more times (non-greedy)\n",
    "- [aeiou]  &nbsp; : Matches a single character in this listed **set**\n",
    "- [^XYZ]   &nbsp; : Matches a single character in **not in** the listed **set**\n",
    "- [a-z0-9] &nbsp; : The set of character can include a **range**\n",
    "- (  &nbsp; : Indicates where string **extraction is to start**\n",
    "- )  &nbsp; : Indicates where string **extraction is to end**\n",
    "\n",
    "For complete information, please refer to the documentation.\n",
    "\n",
    "Let's see some examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found it, and the line is:  From: John to Lui\n",
      "Found it, and the line is:  From: John to the VC:  \"You are fired !!!\"\n",
      "Found it, and the line is:  From: the VC to John:  \"Are you nut?\"\n",
      "Found it, and the line is:  From: cslui to luics\n",
      "Found it, and the line is:  From:cslui to luics\n",
      "Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# Example: only match lines that \"start with the string 'From:'\n",
    "\n",
    "import re\n",
    "handle = open('mbox-short.txt')  # open a file\n",
    "for line in handle:      # process each line at a time\n",
    "    line = line.rstrip() # \n",
    "    if re.search('^From:', line):  # match 'From' at the beginning of a line\n",
    "        print('Found it, and the line is: ', line)\n",
    "\n",
    "handle.close()  # close the opened file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found it, and the line is:  From: John to Lui\n",
      "Found it, and the line is:  From: John to the VC:  \"You are fired !!!\"\n",
      "Found it, and the line is:  From: the VC to John:  \"Are you nut?\"\n",
      "Found it, and the line is:  Fxxm: this is nut1\n",
      "Found it, and the line is:  F12m: this is nut2\n",
      "Found it, and the line is:  F!@m: this is nut3\n",
      "Found it, and the line is:  From: cslui to luics\n",
      "Found it, and the line is:  From:cslui to luics\n",
      "Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# Example: only match lines that \"start with the string 'From:', 'Fxxm:',\n",
    "# 'F12m:', or 'F!@m:'\n",
    "\n",
    "import re\n",
    "handle = open('mbox-short.txt')  # open a file\n",
    "for line in handle:      # process each line at a time\n",
    "    line = line.rstrip() #  strip off white space before the end of line\n",
    "    if re.search('^F..m:', line):  # math 'F..m' at the beginning of a line\n",
    "        print('Found it, and the line is: ', line)\n",
    "handle.close()  # close the opened file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found it, and the line is:  From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# Match lines that start with “From:”, followed by one or more characters (“.+”), \n",
    "# followed by an at-sign (@)”  \n",
    "\n",
    "import re\n",
    "handle = open('mbox-short.txt')  # open a file\n",
    "for line in handle:      # process each line at a time\n",
    "    line = line.rstrip()  \n",
    "    if re.search('^From:.+@', line):  # start with 'From:', with one or more character, and ':'\n",
    "        print('Found it, and the line is: ', line)\n",
    "\n",
    "handle.close()  # close the opened file"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "my_list:  ['cslui@cse.cuhk.edu.hk', 'pclee@cse.cuhk.edu.hk']\n"
     ]
    }
   ],
   "source": [
    "# Extract email addresses\n",
    "\n",
    "import re\n",
    "\n",
    "my_string = 'Hello from cslui@cse.cuhk.edu.hk to pclee@cse.cuhk.edu.hk about the meeting @2PM'\n",
    "\n",
    "# match one or more non-white space, then @, then one or more non-white space\n",
    "my_list = re.findall('\\S+@\\S+', my_string)  \n",
    "\n",
    "print('my_list: ', my_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String pattern matching library\n",
    "\n",
    "Let's look for substrings that start with a single lowercase letter, or uppercase letter, or a number (\"[a-zA-Z0-9]\"), followed by zero or more non-blank character (\"\\S*\"), followed by an **at-sign** (@), followed by zero or more non-blank character (\"\\S*\"), followed by an upper or lower case letter (\"[a-zA-Z]\").  In other words, we are looking for all **email addresses**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['F!@m']\n",
      "['cslui@cse.cuhk.edu.hk', 'vc@cuhk.edu.hk']\n",
      "['lyu@cse.cuhk.edu.hk']\n",
      "['king@cse.cuhk.edu.hk']\n",
      "['eric@cse.cuhk.edu.hk']\n"
     ]
    }
   ],
   "source": [
    "# Let's examine the program\n",
    "import re\n",
    "handle = open(\"mbox-short.txt\")\n",
    "for line in handle:     # process each line\n",
    "    line = line.rstrip()\n",
    "    x = re.findall('[a-zA-Z0-9]\\S*@\\S*[a-zA-Z]', line)\n",
    "    if len(x)> 0:\n",
    "        print(x)\n",
    "        \n",
    "handle.close()   # close the opened file"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## String Matching Library\n",
    "- The *re* module provides regular expression tools for advanced string processing\n",
    "- You can use *re.search()* to see if a string matches a regular expression, similar to useing *find()* method for strings.  Note that the return is *True* of *False*\n",
    "- You can use *re.findall()* to extract portion of a string that matches your regular expression similar to combination of *find()* and slicing: *var[5:10]*. Note that return is a list of string."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "From: John to Lui\n",
      "John is going From: HK to US\n",
      "From: John to the VC:  \"You are fired !!!\"\n",
      "From: the VC to John:  \"Are you nut?\"\n",
      "From: cslui to luics\n",
      "From:cslui to luics\n",
      "From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n",
      "----------------------\n",
      "From: John to Lui\n",
      "John is going From: HK to US\n",
      "From: John to the VC:  \"You are fired !!!\"\n",
      "From: the VC to John:  \"Are you nut?\"\n",
      "From: cslui to luics\n",
      "From:cslui to luics\n",
      "From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# using find() in string vs. re.search() in re\n",
    "\n",
    "handle = open('mbox-short.txt')\n",
    "for line in handle:\n",
    "    line = line.rstrip()\n",
    "    if line.find('From:') >= 0:\n",
    "        print(line)\n",
    "\n",
    "handle.close()\n",
    "print('----------------------')\n",
    "\n",
    "import re\n",
    "\n",
    "\n",
    "handle = open('mbox-short.txt')\n",
    "for line in handle:\n",
    "    line = line.rstrip()\n",
    "    if re.search('From:', line):\n",
    "        print(line)\n",
    "\n",
    "handle.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "From: John to Lui\n",
      "From: John to the VC:  \"You are fired !!!\"\n",
      "From: the VC to John:  \"Are you nut?\"\n",
      "From: cslui to luics\n",
      "From:cslui to luics\n",
      "From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n",
      "----------------------\n",
      "From: John to Lui\n",
      "From: John to the VC:  \"You are fired !!!\"\n",
      "From: the VC to John:  \"Are you nut?\"\n",
      "From: cslui to luics\n",
      "From:cslui to luics\n",
      "From: cslui@cse.cuhk.edu.hk to vc@cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# using  startwith() in string vs. re.search() in re\n",
    "\n",
    "handle = open('mbox-short.txt')\n",
    "for line in handle:\n",
    "    line = line.rstrip()\n",
    "    if line.startswith('From:'):\n",
    "        print(line)\n",
    "\n",
    "handle.close()\n",
    "print('----------------------')\n",
    "\n",
    "import re\n",
    "\n",
    "handle = open('mbox-short.txt')\n",
    "for line in handle:\n",
    "    line = line.rstrip()\n",
    "    if re.search('^From:', line):\n",
    "        print(line)\n",
    "\n",
    "handle.close()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Found pattern 1 '^X.*:', my_string: X-Sieve: CMU Sieve 2.3\n",
      "Found pattern 2 '^X-\\S+:', my_string: X-Sieve: CMU Sieve 2.3\n",
      "-------------\n",
      "Found pattern 1 '^X.*:', my_string: X-DSPAM-Result: Innocent\n",
      "Found pattern 2 '^X-\\S+:', my_string: X-DSPAM-Result: Innocent\n",
      "-------------\n",
      "Found pattern 1 '^X.*:', my_string: X-Plane is behind schedule: two weeks\n",
      "-------------\n"
     ]
    }
   ],
   "source": [
    "# Using \".\" character to match any character. Use \"*\", the character is \"zero or more times\".\n",
    "# Using \"^S^ is any non-whitespace character\n",
    "# Let's illustrate\n",
    "\n",
    "pattern1 = '^X.*:'     # start with'X\", then zero or more character, and ends with ':'\n",
    "pattern2 = '^X-\\S+:'   # start with 'X-', then one or more non-white space character, ends with ':'\n",
    "\n",
    "s1 = 'X-Sieve: CMU Sieve 2.3'\n",
    "s2 = 'X-DSPAM-Result: Innocent'\n",
    "s3 = 'X-Plane is behind schedule: two weeks'\n",
    "my_list = [s1, s2, s3]\n",
    "\n",
    "import re\n",
    "\n",
    "for my_string in my_list:\n",
    "    if re.search(pattern1, my_string):\n",
    "        print(\"Found pattern 1 '\" + pattern1 + \"', my_string: \" + my_string)\n",
    " \n",
    "    if re.search(pattern2, my_string):\n",
    "        print(\"Found pattern 2 '\" + pattern2 + \"', my_string: \" + my_string)\n",
    "    print('-------------')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "y =  ['2', '19', '42']\n",
      "y =  []\n",
      "y =  ['A', 'AA']\n"
     ]
    }
   ],
   "source": [
    "# Note that re.search() returns a \"True/False\" dependeing on whether the string matches the re.\n",
    "# If we want the matching strings to be EXTRACTED, we use re.findall()\n",
    "\n",
    "import re\n",
    "x = 'My 2 favorite numbers are 19 and 42'\n",
    "y = re.findall('[0-9]+', x)   # find all substrings that start with 0 to 9\n",
    "print('y = ', y)\n",
    "\n",
    "y = re.findall('[AEIOU]+', x)\n",
    "print('y = ', y)\n",
    "y = re.findall('[AEIOU]+', 'ABC ddkfj xAA')   # find all substrings that has  A, E, I, O, or U\n",
    "print('y = ', y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "y =  ['From: Using the :']\n"
     ]
    }
   ],
   "source": [
    "# The repeat characters \"*\" and \"+\" push outward in both directions (greedy-fashion) \n",
    "# to match the largest possible string.\n",
    "# Let's illustrate\n",
    "\n",
    "import re\n",
    "x = 'From: Using the : characters'     \n",
    "pattern1 = '^F.+:'\n",
    "\n",
    "y = re.findall(pattern1, x)   # do substring search in a greedy fashion\n",
    "print('y = ', y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "y =  ['From:']\n"
     ]
    }
   ],
   "source": [
    "# If you don't want to use the greedy mode, you can add \"?\" character, then thigns will chill out\n",
    "\n",
    "import re\n",
    "x = 'From: Using the : characters'     \n",
    "pattern1 = '^F.+?:'\n",
    "\n",
    "y = re.findall(pattern1, x)   # do substring search in a non-greedy fashion\n",
    "print('y = ', y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fine tuning string extraction\n",
    "\n",
    "We can refine the match for *re.findall()* and separately determine which portion of the match is to be extracted by using parentheses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "y =  ['stephen.marquard@uct.ac.za']\n",
      "z =  ['stephen.marquard@uct.ac.za']\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "x = \"From stephen.marquard@uct.ac.za Sat Jan 5 09:15:15 2008\"\n",
    "pattern1 = '\\S+@\\S+'       # match non-whitespace character, and \"@\", and non-whitespace character\n",
    "y = re.findall(pattern1, x)\n",
    "print('y = ', y)\n",
    "\n",
    "pattern2 = '^From.*? (\\S+@\\S+)'   # note the use of \"(\" and \")\", we only want to extract that part\n",
    "z = re.findall(pattern2, x)\n",
    "print('z = ', z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "atpos =  22 ; sppos =  38\n",
      "hostname is:  cse.cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "\n",
    "# Given an email address, we want to find the hostname.\n",
    "# For the following example, we want to find 'cse.cuhk.edu.hk'\n",
    "\n",
    "x = \"From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008\"\n",
    "\n",
    "atpos = x.find('@')   # use string's method to find the position of the first '@'\n",
    "sppos = x.find(' ', atpos)   # find the index of space after the atops index\n",
    "\n",
    "print ('atpos = ', atpos, '; sppos = ', sppos)\n",
    "hostname  = x[atpos+1:sppos]\n",
    "print('hostname is: ', hostname)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hostname is:  cse.cuhk.edu.hk\n"
     ]
    }
   ],
   "source": [
    "# Sometimes we split a line one way, and then grab one of the pieces of the line \n",
    "# and split that piece again\n",
    "\n",
    "x = \"From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008\"\n",
    "\n",
    "words = x.split()      # find out list of words\n",
    "email = words[1]       # access to teh email \n",
    "pieces = email.split('@')   # find out username and institution\n",
    "print('hostname is: ', pieces[1])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hostname is:  ['@cse.cuhk.edu.hk']\n"
     ]
    }
   ],
   "source": [
    "# in re module, we can do the following\n",
    "import re\n",
    "\n",
    "x = \"From idiotic.professor@cse.cuhk.edu.hk Sat Jan 5 09:15:15 2008\"\n",
    "\n",
    "# for pattern, starts with '@', '()' is to extract the non-black characters\n",
    "# '[^ ]' is to match non-blank character and finally, '*' is to match many of them.\n",
    "pattern = '@[^ ]*'    \n",
    "\n",
    "hostname = re.findall('@[^ ]*', x)\n",
    "print('hostname is: ', hostname)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# New Lecture"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pattern and its description: 'ab' ('a' followed by 'b')\n",
      "\n",
      "text is: 'abbaaabbbbaaaaabaaaabbbbbbbabbbb'\n",
      "\n",
      "pattern: ab found in starting index= 0 ; ending index= 2\n",
      "pattern: ab found in starting index= 5 ; ending index= 7\n",
      "pattern: ab found in starting index= 14 ; ending index= 16\n",
      "pattern: ab found in starting index= 19 ; ending index= 21\n",
      "pattern: ab found in starting index= 27 ; ending index= 29\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import re\n",
    "\n",
    "\"\"\"\n",
    "Given source text and a list of patterns, look for\n",
    "matches for each pattern within the text and print\n",
    "them to stdout.\n",
    "\"\"\"\n",
    "def test_patterns(text, patterns):\n",
    "    # Look for each pattern in the text and print the results\n",
    "    for pattern, desc in patterns:\n",
    "        print(\"pattern and its description: '{}' ({})\\n\".format(pattern, desc))\n",
    "        print(\"text is: '{}'\\n\".format(text))\n",
    "        for match in re.finditer(pattern, text):\n",
    "            s = match.start()    # found beginning index\n",
    "            e = match.end()      # found ending index\n",
    "            print ('pattern:', pattern, 'found in starting index=', s, \";\", 'ending index=', e)\n",
    "        print()\n",
    "    return\n",
    "\n",
    "\n",
    "test_patterns('abbaaabbbbaaaaabaaaabbbbbbbabbbb',\n",
    "               [('ab', \"'a' followed by 'b'\"),])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pattern and its description: 'ab*' (a followed by zero or more b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab* found in starting index= 0 ; ending index= 3\n",
      "pattern: ab* found in starting index= 3 ; ending index= 4\n",
      "pattern: ab* found in starting index= 4 ; ending index= 8\n",
      "pattern: ab* found in starting index= 8 ; ending index= 9\n",
      "\n",
      "pattern and its description: 'ab+' (a followed by one or more b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab+ found in starting index= 0 ; ending index= 3\n",
      "pattern: ab+ found in starting index= 4 ; ending index= 8\n",
      "\n",
      "pattern and its description: 'ab?' (a followed by zero or one b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab? found in starting index= 0 ; ending index= 2\n",
      "pattern: ab? found in starting index= 3 ; ending index= 4\n",
      "pattern: ab? found in starting index= 4 ; ending index= 6\n",
      "pattern: ab? found in starting index= 8 ; ending index= 9\n",
      "\n",
      "pattern and its description: 'ab{3}' (a followed by three b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab{3} found in starting index= 4 ; ending index= 8\n",
      "\n",
      "pattern and its description: 'ab{2,3}' (a followed by two to three b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab{2,3} found in starting index= 0 ; ending index= 3\n",
      "pattern: ab{2,3} found in starting index= 4 ; ending index= 8\n",
      "\n"
     ]
    }
   ],
   "source": [
    "# Let's try some regular expression syntax like *, +, ?, {}\n",
    "test_patterns(\n",
    "    'abbaabbba',\n",
    "    [('ab*', 'a followed by zero or more b'),\n",
    "     ('ab+', 'a followed by one or more b'),\n",
    "     ('ab?', 'a followed by zero or one b'),\n",
    "     ('ab{3}', 'a followed by three b'),\n",
    "     ('ab{2,3}', 'a followed by two to three b')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Greedy and the non-greedy mode of searching\n",
    "\n",
    "When processing a repetition instruction, *re* consumes as much of the input as possible while matching the pattern. This so-called **greedy** behavior and it may result in fewer individual matches, or the matches may include more of the input text than intended. How can we **turn off** greediness behavior? We can achieve this by following the repetition instruction with ?.  Let's illustrate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pattern and its description: 'ab*?' (a followed by zero or more b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab*? found in starting index= 0 ; ending index= 1\n",
      "pattern: ab*? found in starting index= 3 ; ending index= 4\n",
      "pattern: ab*? found in starting index= 4 ; ending index= 5\n",
      "pattern: ab*? found in starting index= 8 ; ending index= 9\n",
      "\n",
      "pattern and its description: 'ab+?' (a followed by one or more b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab+? found in starting index= 0 ; ending index= 2\n",
      "pattern: ab+? found in starting index= 4 ; ending index= 6\n",
      "\n",
      "pattern and its description: 'ab??' (a followed by zero or one b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab?? found in starting index= 0 ; ending index= 1\n",
      "pattern: ab?? found in starting index= 3 ; ending index= 4\n",
      "pattern: ab?? found in starting index= 4 ; ending index= 5\n",
      "pattern: ab?? found in starting index= 8 ; ending index= 9\n",
      "\n",
      "pattern and its description: 'ab{3}?' (a followed by three b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab{3}? found in starting index= 4 ; ending index= 8\n",
      "\n",
      "pattern and its description: 'ab{2,3}?' (a followed by two to three b)\n",
      "\n",
      "text is: 'abbaabbba'\n",
      "\n",
      "pattern: ab{2,3}? found in starting index= 0 ; ending index= 3\n",
      "pattern: ab{2,3}? found in starting index= 4 ; ending index= 7\n",
      "\n"
     ]
    }
   ],
   "source": [
    "\n",
    "test_patterns(\n",
    "    'abbaabbba',\n",
    "    [('ab*?', 'a followed by zero or more b'),\n",
    "     ('ab+?', 'a followed by one or more b'),\n",
    "     ('ab??', 'a followed by zero or one b'),\n",
    "     ('ab{3}?', 'a followed by three b'),\n",
    "     ('ab{2,3}?', 'a followed by two to three b')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character sets\n",
    "\n",
    "A *character set* is a group of characters, any one of which can match at that point in the pattern. For example, *[ab]* would match either *a* or *b*.  Let's illustrate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_patterns(\n",
    "    'abbaabbba',\n",
    "    [('[ab]', 'either a or b'),\n",
    "     ('a[ab]+', 'a followed by 1 or more a or b'),\n",
    "     ('a[ab]+?', 'a followed by 1 or more a or b, not greedy')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Character set as exclusion\n",
    "\n",
    "A character set can also be used to exclude specific characters. The carat (*^*) means to look for characters that are not in the set following the carat.  Let's illustrate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This pattern finds all of the substrings that do not contain \n",
    "# the characters -, ., or a space.\n",
    "\n",
    "test_patterns(\n",
    "    'This is some text -- with punctuation.',\n",
    "    [('[^-. ]+', 'sequences without -, ., or space')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Characters range\n",
    "\n",
    "As character sets grow larger, typing every character that should (or should not) match becomes tedious. A more compact format using character ranges can be used to define a character set to include all of the contiguous characters between the specified start and stop points."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_patterns(\n",
    "    'This is some text -- with punctuation.',\n",
    "    [('[a-z]+', 'sequences of lowercase letters'),\n",
    "     ('[A-Z]+', 'sequences of uppercase letters'),\n",
    "     ('[a-zA-Z]+', 'sequences of letters of either case'),\n",
    "     ('[A-Z][a-z]+', 'one uppercase followed by lowercase')],\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# As a special case of a character set, the meta-character dot, \n",
    "# or period (.), indicates that the pattern should \n",
    "# match any single character in that position.\n",
    "\n",
    "test_patterns(\n",
    "    'abbaabbba',\n",
    "    [('a.', 'a followed by any one character'),\n",
    "     ('b.', 'b followed by any one character'),\n",
    "     ('a.*b', 'a followed by anything, ending in b'),\n",
    "     ('a.*?b', 'a followed by anything, ending in b')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Escape codes\n",
    "\n",
    "A more compact representation uses escape codes for several predefined character sets. The escape codes recognized by re are listed in the table below.\n",
    "\n",
    "**Regular Expression Escape Codes**<br>\n",
    "Code\t&nbsp; &nbsp; &nbsp; Meaning<br>\n",
    "\\d\t    &nbsp; &nbsp; &nbsp; a digit<br>\n",
    "\\D\t    &nbsp; &nbsp; &nbsp; a non-digit<br>\n",
    "\\s\t    &nbsp; &nbsp; &nbsp; whitespace (tab, space, newline, etc.)<br>\n",
    "\\S\t    &nbsp; &nbsp; &nbsp; non-whitespace<br>\n",
    "\\w\t    &nbsp; &nbsp; &nbsp; alphanumeric<br>\n",
    "\\W\t    &nbsp; &nbsp; &nbsp; non-alphanumeric<br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_patterns(\n",
    "    'A prime #1 example!',\n",
    "    [(r'\\d+', 'sequence of digits'),\n",
    "     (r'\\D+', 'sequence of non-digits'),\n",
    "     (r'\\s+', 'sequence of whitespace'),\n",
    "     (r'\\S+', 'sequence of non-whitespace'),\n",
    "     (r'\\w+', 'alphanumeric characters'),\n",
    "     (r'\\W+', 'non-alphanumeric')],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Anchoring\n",
    "\n",
    "In addition to describing the content of a pattern to match, the relative location can be specified in the input text where the pattern should appear by using anchoring instructions. The table below lists valid anchoring codes.\n",
    "\n",
    "**Code**\t&nbsp; &nbsp; &nbsp; **Meaning** <br>\n",
    "^\t&nbsp; &nbsp; &nbsp; start of string, or line <br>\n",
    "$\t&nbsp; &nbsp; &nbsp; end of string, or line <br>\n",
    "\\A\t&nbsp; &nbsp; &nbsp; start of string <br>\n",
    "\\Z\t&nbsp; &nbsp; &nbsp; end of string <br>\n",
    "\\b\t&nbsp; &nbsp; &nbsp; empty string at the beginning or end of a word <br>\n",
    "\\B\t&nbsp; &nbsp; &nbsp;empty string not at the beginning or end of a word <br>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_patterns(\n",
    "    'This is some text -- with punctuation.',\n",
    "    [(r'^\\w+', 'word at start of string'),\n",
    "     (r'\\A\\w+', 'word at start of string'),\n",
    "     (r'\\w+\\S*$', 'word near end of string'),\n",
    "     (r'\\w+\\S*\\Z', 'word near end of string'),\n",
    "     (r'\\w*t\\w*', 'word containing t'),\n",
    "     (r'\\bt\\w+', 't at start of word'),\n",
    "     (r'\\w+t\\b', 't at end of word'),\n",
    "     (r'\\Bt\\B', 't, not start or end of word')],\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Constraining the search"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In situations where it is known in advance that only a subset of the full input should be searched, the regular expression match can be further constrained by telling re to limit the search range. For example, if the pattern must appear at the front of the input, then using *match()* instead of *search()* will anchor the search without having to explicitly include an anchor in the search pattern.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "text = 'This is some text -- with punctuation.'\n",
    "pattern = 'is'\n",
    "\n",
    "print('Text   :', text)\n",
    "print('Pattern:', pattern)\n",
    "\n",
    "m = re.match(pattern, text)\n",
    "print('Match  :', m)\n",
    "s = re.search(pattern, text)\n",
    "print('Search :', s)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}