2. Introduction
• Having got a string in your system, how do you manipulate it?
• Strings are fundamental forms of data representation.
• Often obtained from text-files and user input.
• Most strings are not in an easily managed form.
• The process of parsing is used to render raw data into more
refined forms.
3. Parsing
• There are many reasons why we may wish to parse data.
• Information comes in as a string – we want it in an array.
• Information comes in as lists of string numbers, we want them in
objects
• We are rarely so lucky as to be able to instantly manipulate
data that comes in to the system.
4. Data Representation
• The absolute most important thing in designing a program is
to represent your data right.
• If you get this right, everything is easier as a result.
• If you get it wrong, everything is more difficult.
• Before you ever write a line of code, consider how data must
be represented in the system.
• What variables, objects and arrays are you going to use?
5. Data Representation
• Consider how you are going to need to manipulate the data in
the system.
• Are you going to need to be able to search through things?
• Are you going to need to process each value in turn?
• Are you going to need to represent relationships between things?
• An easily manipulated data structure is worth its weight in
gold.
6. Parsing
• Parsing is the process of turning difficult to manipulate data
into a more useful format.
• Break strings up into all their constituent parts
• Convert from multiple arrays into an array of objects
• Important first step before more complex processing.
• Various standard techniques exist to facilitate this.
7. Common Parsing Tasks
• Tokenization
• Turn a string into several smaller strings through the use of
tokens
• Object processing
• Breaking multiple data fields out of a single string and configuring
an object
• Data conversion
• Bringing data elements into some common format
• Often necessary to combine different processes.
8. Tokenization
• Tokenization is the process of splitting up strings.
• Based on the idea of a delimiter.
• Strings that have a common, delimited structure are amenable
to tokenization.
• 10,20,30,40
• Jim,Jake,Jane,Johana
• Strings are broken up based on the delimiter and the result is
an array of strings.
9. Object Processing
• Object processing involves the creation of a ‘blank’ object and
setting its attributes as a result of input.
• Often done after tokenization of input.
• The end result is an object configured as desired.
• One way to handle persisting objects in files.
• May be repeated.
• Create an array of appropriately configured objects.
10. Data Conversion
• As a result of parsing, can take the time to convert data into
more appropriate representations.
• After pulling numbers in from a file, they’re usually stored as
strings.
• Can use various conversion functions to clean up representation.
• atoi, as an example
• Can convert from rough representations to more precise
representations.
11. Example
• Consider the following example scenario – calculate the Flesch
Readability index of a document.
• Need to determine:
• Number of sentences
• Number of words
• Number of syllables in words
• Read in as a string from a text file.
• Must be parsed.
12. The Hard Way
• Can manipulate a string directly.
• Count spaces in a string.
• That gives word count, roughly
• Count full stops in a string
• That gives the number of sentences
• Syllable count?
• Uh…
• Horrors upon horrors
• Must parse to get a structure amenable to processing.
• An array of strings.
13. String Processing
• Strings contain many useful functions for handling such
parsing.
• find function gives the location of a particular character.
#include <iostream>
using namespace std;
int main() {
string str = "Hello World";
int index;
index = str.find ("e", 0);
cout << "Found at: " << index << endl;
}
14. String Processing
• Can use the substr function of a string to extract a substring
from a full string:
#include <iostream>
using namespace std;
int main() {
string str = "Hello World";
string sub = str.substr (0, 5);
cout << "Substring is: " << sub << endl;
}
15. Working With Strings
• Strings also contain a very useful length function.
• This tells you how many characters they contain.
• Also possible to index a string just like an array.
• This lets you get individual characters out of a string.
• Can combine these into powerful functions.
16. Tokenization
#include <iostream>
using namespace std;
int main() {
string arr[100];
int size;
string sub;
string str = "Snausages are snausages for snausages";
int start;
size = 0;
start = 0;
for (int i = 0; i < str.length(); i++) {
if (str[i] != ' ' && i != str.length() - 1) {
continue;
}
sub = str.substr (start, i-start);
arr[size] = sub;
start = i+1;
size += 1;
}
}
17. Tokenization
• There are other ways to tokenize.
• This is just one way to show the power of string manipulation.
• Serves as a basis for more complex data parsing.
• Important to be able to do this – all program representation
breaks down into parsing at some point or another.
18. Object Representation
• Can combine tokenization with object representation.
• Tokenize individual elements.
• Convert them to appropriate data format.
• Use accessor methods on an object to configure.
• Can easily set up large amounts of objects with this kind of
system.
• Combine the objects in an array for the best of both worlds.
19. This Is The End…
• With that, it brings us to the end of the scheduled content for
C++.
• Cheer / Cry as you feel is appropriate.
• Next week, we’ll use the time as consolidation time.
• Thursday lecture will be a formal revision lecture covering all the
topics we have previously met.
• Wednesday lecture/tutorial will be a drop in revision session. No
planned content, come along with whatever questions you have.
20. Some Final Thoughts
• Programming is hard.
• I did warn you at the start!
• It’s also a very rare and valuable skill
• Which you are moving towards properly building.
• It is a skill that requires training.
• Like playing a musical instrument or fighting off ninjas.
• Important not to let it slide.
21. Some Final Thoughts
• It’s worthwhile keeping a notebook of ‘things I wish I had software
to do’.
• It can serve as a basis for further exploration of programming.
• Don’t worry if you don’t know how to do the things.
• Research is a constant part of programming. Nobody knows how to
do everything.
• Stretching yourself by setting tasks you don’t know how to do is a
great way to learn.
• Even if you never complete it, the process is valuable.
22. Summary
• Parsing is an important part of software development.
• It helps you turn unstructured data into structured data.
• Comes in many forms.
• String parsing is the most immediately useful of these.
• Tokenization is a key parsing technique.
• Worth playing about with.