Amazing fact is that the 763-768 Digits of Pi are 999999. This is known as the Feynman Point and the order position is 66 so I do the proof of concept.
All other sequences and regex are shown too.
Decoding Patterns: Customer Churn Prediction Data Analysis Project
One π Rex to Heaven: Exploring Patterns in Pi with Regular Expressions
1. One π Rex to Heaven
Start with Regular Expressions 2.1
1.1 From Reg-Ex to Pi-Rex
Pi is an irrational number, and its sequence is as we know infinite and non-repeating.
But there are some strange and amazing patterns to be found (as there would be if
you look hard and tough enough in any infinitely large number).
Feynman thought that if he could memorize Pi up to the 762nd decimal place, he
could trick people into thinking it was rational and say “999999 and so on and so on
and far away. So this was my motivation to find out how many sequences of the
same following digit are in Pi like this below (and I found Hell):
PI EXplore2:
1: 33 2: 88 3: 99 4: 44
5: 99 6: 11 7: 66 8: 44
9: 55 10: 22 11: 111 12: 11
13: 555 14: 44 15: 22 16: 44
CONST C_PI_BIG =
'3.141592653589793238462643383279502884197169399375105820974944592307
8164062862089986280348253421170679821480865132823066470938446095505
8223172535940812848111745028410270193852110555964462294895493038196
4428810975665933446128475648233786783.....
Sequences like this used to be important examples in motivating discussions about
the law of the excluded middle. Statements like: “the full decimal expansion of Pi
contains the sequence 7777”, according to the law of the excluded middle, are either
true or false.
2. And I did the proof concept with one single RegEx (d)1+ and the above const:
(in Appendix we solve the newline rn problem of text breaks between numbers!)
writeln('PI Explore: '+getMatchString('(?m)(d|s)1+',
C_PI_BIG));
You can use a back-reference 1+. Backreferences match the same text as
previously matched by a capturing group.
Regular expressions are the main way many tools matches patterns within strings.
For example, finding pieces of text within a larger doc, or finding a restriction site
within a larger sequence. This report illustrates what a RegEx is and what you can do
to find, match, compare or replace text of documents or code.
2
3. We are so amazed at how a function can do the job, because we never had the
experience with a RegEx. The back-reference 1 (backslash one) references the first
found capturing group. 1 matches the exact same number that was matched by the
first capturing group. Fact is that 763-768 Digits of pi are 999999 (Feynman Point).
And this is the function, it just shows all sequences on one string:
function getMatchString2(arex, atext: string): string;
begin
with TRegExpr.Create do
try
Expression:= arex;
it:= 0;
{ Match format search...}
result:= result+CRLF;
if Exec(atext) then
repeat
Inc(it);
result:= result+ Format(#09'%d: %-12s',[it, match[0]])
if it mod 5=0 then
result:= result+#13#10;
until NOT ExecNext; //MatchNext < 0;
finally
Free;
end;
WriteLn('Done REX2 - Hit NOthing to exit');
end;
.
And then I did a sort with a Stringlist collection to get an ordered result:
PI Explore2 (found 600 groups in Pi file with 7000 decimals) :
01: 33 02: 88 03: 99 04: 44 05: 99
06: 11 07: 66 08: 44 09: 55 10: 22
11: 111 12: 11 13: 555 14: 44 15: 22
16: 44 17: 88 18: 66 19: 33 20: 44
21: 33 22: 66 23: 66 24: 33 25: 00
26: 66 27: 55 28: 88 29: 88 30: 00
31: 11 32: 33 33: 88 34: 66 35: 11
36: 33 37: 11 38: 11 39: 11 40: 44
41: 99 42: 88 43: 22 44: 11 45: 33
46: 33 47: 44 48: 66 49: 22 50: 77
51: 66 52: 000 53: 77 54: 77 55: 77
56: 44 57: 22 58: 22 59: 99 60: 11
61: 44 62: 77 63: 77 64: 99 65: 11
66: 999999 67: 99 68: 44 69: 55 70: 22
3
4. As I said we found hell then at position 66 you see 6 times the 9 so the Feynman
point 999999 is found at group position 66!
Still, I decided to look more closely at this large sequence that occurs (relatively)
early on in the Pi expansion explorer. The stripe const in the Pi sorted map
mentioned above corresponds to the sequence 999999 beginning at digit 763 and
ending at digit 768. This was observed in a set of the first 1000 digits of Pi.
96086403441815981362977477130996051870721134999999837297804995105973173281609631
Even with 10000 digits, the 999999 sequence at digit 763 was still the longest
sequence of repeated digits. The other runner-up sequences in this first 10000 digits
are, two instances of 2222, a single run of 8888, and four distinct occurrences of
7777.
-------------PI sort group2x - 1 to 1000 decimals ----------------------
3 x 00
2 x 000
12 x 11
2 x 111
8 x 22
10 x 33
11 x 44
4 x 55
1 x 555
11 x 66
9 x 77
9 x 88
6 x 99
1 x 999999
Done REX2 - Hit NOthing to exit
Now that the digits of pi have been calculated to astounding lengths, and that
mathematics exists to uncover the properties the pi digit sequence, the expansion of
pi is no longer an ideal candidate for constructivist thought experiments.
The calculation of Pi also figures in the Season 2 Star Trek episode "Wolf in the Fold"
(1967), in which Captain Kirk and Mr. Spock force an evil entity (composed of pure
energy and which feeds on fear) out of the starship Enterprise's mainframe computer
by commanding the computer to "compute to the last digit the value of Pi," thus
sending the computer into an infinite loop.
function getMatchStringSortGroup2(arex, atext: string): string;
var s : TStringList; gcnt: integer;
begin
s:= TStringList.Create;
s.sorted:= true;
s.duplicates:= true;
with TRegExpr.Create do
4
5. try
Expression:= arex;
it:= 0;
{ Match format search...}
result:= result+CRLF;
if Exec(atext) then
repeat
Inc(it);
s.add({itoa(it)+': '+}match[0])
until NOT ExecNext; //MatchNext < 0;
result:= s.text;
gcnt:= 0;
for it:= 0 to s.count-2 do begin
if s[it] = s[it+1] then inc(gcnt) else begin
writeln(itoa(gcnt+1)+ ' x ' +s[it]);
gcnt:= 0;
end;
if it=s.count-2 then
writeln(itoa(gcnt+1)+ ' x ' +s[it+1])
end;
finally
Free;
s.Free;
end;
WriteLn('Done REX2 - Hit NOthing to exit');
end;
Impressive or not. We can argue that a function has the advantage to enlarge with
objects and methods but the same goes also with a RegEx object (and there are
many):
//replace StripTags with null
with TStRegex.Create(NIL) do begin
//inputfile:= '<p>This is text.<br/> This is line 2</p>';
inputfile:= exepath+'geomapX.txt';
matchpattern.clear;
matchpattern.add('<[^>]*>'); //find all tags and strip it!
replacepattern.add('z'); //Null expression!
outputfile:= exepath+'geomapXoutreplace2.txt'
Execute;
Free
end;
As you will see regular expressions are composed of characters, character classes,
meta-characters, groups, quantifiers, and assertions.
So what’s a Regular Expression?
5
6. A RegEx describes a search pattern of text typically made up from special characters
called meta-characters:
You can test whether a string matches an expression pattern
You can use a RegEx to search/replace characters in a string
It’s very powerful and reusable, but a bit tough to read
Let’s jump to a history and the beginning of RegEx with Perl PCRE. Perl is a horribly
flawed and very useful scripting language, based on UNIX shell scripting and C, that
helped lead to many other better languages. Perl was and is also excellent for
string/file/text processing because it built regular expressions directly into the
language as a first-class data type.
1.2 RegEx Tutor out of the Box
As you already know the tool is split up into the toolbar across the top, the editor or
code part in the centre and the output window at the bottom or the interface part on
the right. Change that in the menu /View at our own style.
In maXbox you will execute the RegEx as a script, libraries and units are already
built. So far so good now we’ll open the example:
646_pi_evil2.TXT
http://sourceforge.net/projects/maxbox/files/Examples/13_General/646_pi_evi
l2.TXT/download
Now let’s take a look at the code of this fist part project. Our first line is
013 Program PI_REX;
We name it, means the program’s name is above.
This example includes (self-contained in maXbox) two objects from the classes:
TRegExpr and TPerlRegEx so the 2 one is from the well known PCRE Lib.
TPerlRegEx is a VCL wrapper around the open source PCRE library, which
implements Perl-Compatible Regular Expressions.
This version of TPerlRegEx is compatible with the TPerlRegEx class (PCRE 7.9) in the
RegularExpressionsCore unit in Delphi XE. In fact, the unit in Delphi XE and
maXbox3 is derived from the version of TPerlRegEx that we are using now.
Let’s do a second RegEx. We want to check if a name is a valid Pascal name like a
syntax checker does. We use straight forward a function in the box:
732 if ExecRegExpr('^[a-zA-Z_][a-zA-Z0-9_].*','pascal_name_kon')
then writeln('pascal name valid') else writeln('pascal name invalid');
This is a useful global function:
function ExecRegExpr (const ARegExpr, AInputStr: string): boolean;
It is true if a string AInputString matches regular expression ARegExpr and it will
raise an exception if syntax errors in ARegExpr are done.
6
7. Now let’s analyse our first RegEx step by step '^[a-zA-Z_][a-zA-Z0-9_].*':
^ matches the beginning of a line; $ the end
[a-z] matches all twenty six small characters from 'a' to 'z'
[a-zA-Z_] matches any letter with underscore
[a-zA-Z0-9_] matches any letter or digit with underscore
.* (a dot) matches any character except n
* means zero or more occurrences (repetition)
[ ] group characters into a character set;
A lot of rules for the beginning, and they look ugly for novices, but really they are very
simple (well, usually simple ;)), handy and powerful tool too. You can validate e-mail
addresses; extract phone numbers or ZIP codes from web-pages or documents,
search for complex patterns in log files and all you can imagine! Rules (templates)
can be changed without your program recompilation! This can be especially useful
for user input validation in DBMS and web projects.
Any item of a regular expression may be followed by another type of meta-characters
– called iterators. Using this characters you can specify number of occurrences of
previous characters so inside [], most modifier keys act as normal characters:
/what[.!*?]*/ matches "what", "what.", "what!", "what?**!", ..
So a character class is a way of matching 1 character in the string being searched to
any of a number of characters in the search pattern.
Character classes are defined using square brackets. Thus [135] matches any of 1,
3, or 5. A range of characters (based on ASCII order) can be used in a character
class: [0-7] matches any digit between 0 and 7, and [a-z] matches and small (but
not capital) letter.
Note that the hyphen is being used as a meta-character here. To match a literal
hyphen in a character class, it needs to be the first character. So [-135] matches any
of -, 1, 3, or 5. [-0-9] matches any digit or the hyphen.
What if we want to define a certain place? An assertion is a statement about the
position of the match pattern within a string. The most common assertions are “^”,
which signifies the beginning of a string, and “$”, which signifies the end of the string.
For example search all empty or blank lines: Search empty lines: ‘^$’
This is how we can assert a valid port number with ^ and $:
745 if ExecRegExpr('^(:dd?d?d?d?)$',':80009')
then writeln('regex port true') else writeln('regex port false');
There are 3 main operators that use regular expressions:
1. Matching (which returns TRUE if a match is found and FALSE if no match is
found.
2. Substitution, which substitutes one pattern of characters for another within a
string
3. Split, which separates a string into a series of sub-strings
7
8. If you want to match a certain number of repeats of a group of characters, you can
group the characters within parentheses. For example, /(cat){3}/ matches 3 reps of
“cat” in a row: “catcatcat”. However, /cat{3}/ matches “ca” followed by 3 t’s: “cattt”.
And things go on. To negate or reject a character class, that is, to match any chars
EXCEPT what is in the class, use the caret ^ as the first symbol in the class. [^0-9]
matches any character that isn’t a digit. [^-0-9] ,matches any character that isn’t a
hyphen or a digit.
Now some time to reflect:
RE Metacharacter Matches…
^ beginning of line
$ end of line
char Escape the meaning of char following it
[^] One character not in the set
< Beginning of word anchor
> End of word anchor
( ) or ( ) Tags matched characters to be used later
(max = 9)
| or | Or grouping
x{m} Repetition of character x, m times (x,m =
integer)
x{m,} Repetition of character x, at least m times
x{m,n} Repetition of character x between m and
m times
2. Overview of Matches
You can specify a series of alternatives for a pattern using "|'' to separate them, so
that fee|fie|foe will match any of "fee'', "fie'', or "foe'' in the target string (as would f(e|i|
o)e). The first alternative includes everything from the last pattern delimiter ("('',
"['', or the beginning of the pattern) up to the first "|'', and the last alternative
contains everything from the last "|'' to the next pattern delimiter.
Sure so we test the knowledge:
Writeln(ReplaceRegExpr('<[^>]*>',
'<p>This is text.<br/> This is line 2</p>','', True))
The RegEx is: '<[^>]*>'
[^>] Match not any brackets >.
^ Only matches the beginning of a string.
* Match zero or more occurrences; equivalent to {0,}.
I'm not sure but this RegEx: '<.*?>' should be equivalent!
For this reason, it's common practice to include alternatives in parentheses, to
minimize confusion about where they start and end.
8
9. Next a few examples to see the atoms:
rex:= '(no)+.*'; //Print all lines containing one or more consecutive
occurrences of the pattern “no”.
rex:= '.*S(h|u).*'; //Print all lines containing uppercase letter “S”,
followed by either “h” or “u”.
rex:= '.*.[^0][^0].*'; //Print all lines ending with a period and exactly
two non-zero numbers.
rex:= '.*[0-9]{6}..*'; //least 6 consecutive numbers follow by a period.
Next we want to see how the objects in the box work:
The static versions of the methods are provided for convenience, and should only be
used for one off matches, if you are matching in a loop or repeating the same search
often then you should create an 'instance' of the TRegEx record and use the non static
methods.
The RegEx unit defines TRegEx and TMatch as records. That way you don’t have to
explicitly create and destroy them. Internally, TRegEx uses TPerlRegEx to do the heavy
lifting. TPerlRegEx is a class that needs to be created and destroyed like any other
class. If you look at the TRegEx source code, you’ll notice that it uses an interface to
destroy the TPerlRegEx instance when TRegEx goes out of scope. Interfaces are
reference counted in Delphi, making them usable for automatic memory
management.
One of them is a song finder to get a song list from an mp3 file and play them!
4: Mastering
There are plenty more regular expression tricks and operations, which can be found
in Programming Perl, or, for the truly devoted, Mastering Regular Expressions.
Next we enter part two of some insider information about implementation.
9
10. 5: Enter a RegEx building box
1.3 RegEx in Delphi, C#, Lazarus and maXbox
The TPerlRegEx class aimes at providing any Delphi, Java or C++Builder developer
with the same, powerful regular expression capabilities provided by the Perl
programming language community, created by Larry Wall.
It is implemented as a wrapper around the open source PCRE library.
The regular expression engine in Delphi XE is PCRE (Perl Compatible Regular
Expression). It's a fast and compliant (with generally accepted RegEx syntax) engine
which has been around for many years. Users of earlier versions of Delphi can use it
with TPerlRegEx, a Delphi class wrapper around it.
TRegEx is a record for convenience with a bunch of methods and static class methods
for matching with regular expressions.
I've always used the RegularExpressionsCore unit rather than the higher level stuff
because the core unit is compatible with the unit that Jan Goyvaerts has provided for
free for years.
For new code written in Delphi XE, you should definitely use the RegEx unit that is
part of Delphi rather than one of the many 3rd party units that may be available. But if
you're dealing with UTF-8 data, use the RegularExpressionsCore unit to avoid
needless UTF-8 to UTF-16 to UTF-8 conversions.
By the way there’s another tool: Compose and analyze RegEx patterns with
RegexBuddy's easy-to-grasp RegEx blocks and intuitive RegEx tree, instead of or in
combination with the traditional RegEx syntax. Developed by the author of the
website http://www.regular-expressions.info/, RegexBuddy makes learning and
using regular expressions easier than ever.
10
11. 6: The RegexBuddy and the GUI
Conclusion: A regular expression (RegEx or REX for short) is a special text string for
describing a search pattern. You can think of regular expressions as wild-cards on
steroids. You are probably familiar with wild-card notations such as *.txt to find all text
files in a file manager. The RegEx equivalent is .*.txt$.
Feedback @ max@kleiner.com
Literature: Kleiner et al., Patterns konkret, 2003, Software & Support
Links of maXbox and RegEx Slide Show:
http://www.softwareschule.ch/maxbox.htm
http://www.softwareschule.ch/download/A_Regex_EKON16.pdf
http://www.regular-expressions.info/
http://regexpstudio.com/tregexpr/help/RegExp_Syntax.html
http://sourceforge.net/projects/maxbox
http://sourceforge.net/projects/maxbox/files/Examples/13_General/547_regexmaster3_BASTA.TXT/download
http://sourceforge.net/projects/maxbox/files/Docu/BASTA_mastering_regex_power.pdf/download
11
12. 1.4 Appendix
EXAMPLE: Mail Finder and C# PI Code
procedure delphiRegexMailfinder;
begin
TestString:= '<one@server.domain.xy>, another@otherserver.xyz';
PR:= TPerlRegEx.Create;
try
PR.RegEx:= 'b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}b';
PR.Options:= PR.Options + [preCaseLess];
PR.Compile;
PR.Subject:= TestString; // <-- tell PR where to look for matches
if PR.Match then begin
WriteLn(PR.MatchedText); // Extract first address
while PR.MatchAgain do
WriteLn(PR.MatchedText); // Extract subsequent addresses
end;
finally
PR.Free;
end;
//Readln;
end;
procedure delphiRegexMailfinder;
begin
// Initialize a test string to include some email addresses. This would normally
// be your eMail text.
TestString:= '<one@server.domain.xy>, another@otherserver.xyz';
PR:= TPerlRegEx.Create;
try
PR.RegEx:= 'b[A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4}b'; // <-- this is regex used.
PR.Options:= PR.Options + [preCaseLess, premultiline];
PR.Compile;
PR.Subject:= TestString; // <-- tell the TPerlRegEx where to look for matches
if PR.Match then begin
// At this point a first matched eMail is already in MatchedText so grab it
WriteLn(PR.MatchedText); // Extract first address (one@server.domain.xy)
// Let the regex engine look for more matches in a loop:
while PR.MatchAgain do
WriteLn(PR.MatchedText); //Extract subsequent addresses(another@otherserver)
end;
finally PR.Free;
end;
//Readln;
end;
12
13. C# example
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
namespace ConsoleApplication2
{
class ProgramPI_BIG
{
static void Main(string[] args)
{
string PIvalue =
"3.1415926535897932384626433832795028841971693993751058209749445923078164062862089
986280348253421170679821480865132823066470938446095505822317253594081284811174502
841027019385211055596446229489549303819644288109756659334461284756482337867831652
712019091456485669234603486104543266482133936072602491412737245870066063155881748
815209209628292540917153643678925903600113305305488204665213841469519415116094330
572703657595919530921861173819326117931051185480744623799627495673518857527248912
279381830119491298336733624406566430860213949463952247371907021798609437027705392
171762931767523846748184676694051320005681271452635608277857713427577896091736371
787214684409012249534301465495853710507922796892589235420199561121290219608640344
181598136297747713099605187072113499999983729780499510597317328160963185950244594
553469083026425223082533446850352619311881710100031378387528865875332083814206171
776691473035982534904287554687311595628638823537875937519577818577805321712268066
13001927876611195909216420198";
{
Regex regex = new Regex(@"(?m)(d)1+");
Match match = regex.Match(PIvalue);
int i = 1;
if (match.Success)
{
Console.WriteLine("maXbox PI SeqTest: "+ match.Value);
//match.NextMatch();
} while (match.Success)
{
Console.WriteLine(i+" Match: {0}", match.Value);
match = match.NextMatch();
i++;
}
}
}
}
}
13
14. 1.4.1 PI Finder with multiline file string
If you want the first two and last two digits to be the same to match (your examples
suggest this) than the regex ^(dd)d{0,4}1$ is the appropriate.
But when you use a file with line breaks and numbers in between to find (yellow
marker below) there's another workaround:
53061422881375850430633217518297986622371721591607
71669254748738986654949450114654062843366393790039
76926567214638530673609657120918076383271664162748
88800786925602902284721040317211860820419000422966
17119637792133757511495950156604963186294726547364
25230817703675159067350235072835405670403867435136
22224771589150495309844489333096340878076932599397
80541934144737744184263129860809988868741326047215
This assumes with ReplaceRegExpr() the whole text has been read into a single
string first (i.e., you're not reading a file line-by-line), but it doesn't assume the line
separator is always n, as your code does and then the find matches. At the
minimum you should allow for rn and r as well:
if fileexists(exepath+'examplespi_numbers7000.txt') then begin
sr:= ReplaceRegExpr ('(?s)(rn)
+',filetoString(exepath+'examplespi_numbers7000.txt') ,'',true)
//writeln('test pi file with CRLF'+sr)
writeln('PI EXplore7000: '+getMatchString2('(?m)(d)1+',sr))
writeln('PI EXplore7000: '+CRLF+getMatchStringSortGroup3('(?m)(d)1+',sr))
end;
Line breaks can be signalled by different control characters. On UNIX systems, it is
typical to signal a line break with a single ACSII character 10. Windows, on the other
hand, still maintains the tradition of combining both carriage return (character 13)
and newline character.
Pi is an extraordinary number. It's "irrational", meaning it cannot be written as the
ratio of two whole numbers. It's also "transcendental", because it cannot be the
solution to any algebraic equation. 22/7 is a useful simple approximation to pi,
evaluating to 3.142857 to six decimal places.
PI Explore7000 Decimals:
1: 33 2: 88 3: 99 4: 44 5: 99
6: 11 7: 66 8: 44 9: 55 10: 22
11: 111 12: 11 13: 555 14: 44 15: 22
16: 44 17: 88 18: 66 19: 33 20: 44
14
18. 545: 55 546: 00 547: 66 548: 77
549: 55 550: 33 551: 11 552: 99
553: 66 554: 33 555: 33 556: 11
557: 77 558: 22 559: 11 560: 22
561: 88 562: 11 563: 00 564: 66
565: 99 566: 55 567: 00 568: 55
569: 22 570: 33 571: 44 572: 55
573: 00 574: 22 575: 99 576: 55
577: 55 578: 44 579: 77 580: 77
581: 111 582: 22 583: 00 584: 666
585: 44 586: 33 587: 888 588: 55
589: 33 590: 33 591: 55 592: 66
593: 333 594: 99 595: 11 596: 22
597: 88 598: 111 599: 33 600: 00
1.4.2 Group Statistic
43 x 00
5 x 000
60 x 11
8 x 111
55 x 22
5 x 222
1 x 2222
61 x 33
3 x 333
51 x 44
11 x 444
53 x 55
5 x 555
54 x 66
6 x 666
65 x 77
3 x 777
4 x 7777
43 x 88
4 x 888
1 x 8888
57 x 99
1 x 999
1 x 999999
Done REX3 - Hit NOthing to exit
PI EXplore7000:
//************************************ Code Finished******************************
18
19. 1.5 Appendix String RegEx methods
.match(regexp) returns first match for this string
against the given regular expression;
if global /g flag is used, returns array
of all matches
.replace(regexp, text) replaces first occurrence of the regular
expression with the given text; if
global /g flag is used, replaces all
occurrences
.search(regexp) returns first index where the given
regular expression occurs
.split(delimiter[,limit]) breaks apart a string into an array of
strings using the given regular as the
delimiter; returns the array of tokens
19