String Parsing
Michael Heron
Introduction
• Having got a string in your system, how do you manipulate it?
• Strings are fundamental forms of data repre...
Parsing
• There are many reasons why we may wish to parse data.
• Information comes in as a string – we want it in an arra...
Data Representation
• The absolute most important thing in designing a program is
to represent your data right.
• If you g...
Data Representation
• Consider how you are going to need to manipulate the data in
the system.
• Are you going to need to ...
Parsing
• Parsing is the process of turning difficult to manipulate data
into a more useful format.
• Break strings up int...
Common Parsing Tasks
• Tokenization
• Turn a string into several smaller strings through the use of
tokens
• Object proces...
Tokenization
• Tokenization is the process of splitting up strings.
• Based on the idea of a delimiter.
• Strings that hav...
Object Processing
• Object processing involves the creation of a ‘blank’ object and
setting its attributes as a result of ...
Data Conversion
• As a result of parsing, can take the time to convert data into
more appropriate representations.
• After...
Example
• Consider the following example scenario – calculate the Flesch
Readability index of a document.
• Need to determ...
The Hard Way
• Can manipulate a string directly.
• Count spaces in a string.
• That gives word count, roughly
• Count full...
String Processing
• Strings contain many useful functions for handling such
parsing.
• find function gives the location of...
String Processing
• Can use the substr function of a string to extract a substring
from a full string:
#include <iostream>...
Working With Strings
• Strings also contain a very useful length function.
• This tells you how many characters they conta...
Tokenization
#include <iostream>
using namespace std;
int main() {
string arr[100];
int size;
string sub;
string str = "Sn...
Tokenization
• There are other ways to tokenize.
• This is just one way to show the power of string manipulation.
• Serves...
Object Representation
• Can combine tokenization with object representation.
• Tokenize individual elements.
• Convert the...
This Is The End…
• With that, it brings us to the end of the scheduled content for
C++.
• Cheer / Cry as you feel is appro...
Some Final Thoughts
• Programming is hard.
• I did warn you at the start!
• It’s also a very rare and valuable skill
• Whi...
Some Final Thoughts
• It’s worthwhile keeping a notebook of ‘things I wish I had software
to do’.
• It can serve as a basi...
Summary
• Parsing is an important part of software development.
• It helps you turn unstructured data into structured data...
Upcoming SlideShare
Loading in …5
×

CPP18 - String Parsing

206 views

Published on

This is an introductory lecture on C++, suitable for first year computing students or those doing a conversion masters degree at postgraduate level.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
206
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

CPP18 - String Parsing

  1. 1. String Parsing Michael Heron
  2. 2. Introduction • Having got a string in your system, how do you manipulate it? • Strings are fundamental forms of data representation. • Often obtained from text-files and user input. • Most strings are not in an easily managed form. • The process of parsing is used to render raw data into more refined forms.
  3. 3. Parsing • There are many reasons why we may wish to parse data. • Information comes in as a string – we want it in an array. • Information comes in as lists of string numbers, we want them in objects • We are rarely so lucky as to be able to instantly manipulate data that comes in to the system.
  4. 4. Data Representation • The absolute most important thing in designing a program is to represent your data right. • If you get this right, everything is easier as a result. • If you get it wrong, everything is more difficult. • Before you ever write a line of code, consider how data must be represented in the system. • What variables, objects and arrays are you going to use?
  5. 5. Data Representation • Consider how you are going to need to manipulate the data in the system. • Are you going to need to be able to search through things? • Are you going to need to process each value in turn? • Are you going to need to represent relationships between things? • An easily manipulated data structure is worth its weight in gold.
  6. 6. Parsing • Parsing is the process of turning difficult to manipulate data into a more useful format. • Break strings up into all their constituent parts • Convert from multiple arrays into an array of objects • Important first step before more complex processing. • Various standard techniques exist to facilitate this.
  7. 7. Common Parsing Tasks • Tokenization • Turn a string into several smaller strings through the use of tokens • Object processing • Breaking multiple data fields out of a single string and configuring an object • Data conversion • Bringing data elements into some common format • Often necessary to combine different processes.
  8. 8. Tokenization • Tokenization is the process of splitting up strings. • Based on the idea of a delimiter. • Strings that have a common, delimited structure are amenable to tokenization. • 10,20,30,40 • Jim,Jake,Jane,Johana • Strings are broken up based on the delimiter and the result is an array of strings.
  9. 9. Object Processing • Object processing involves the creation of a ‘blank’ object and setting its attributes as a result of input. • Often done after tokenization of input. • The end result is an object configured as desired. • One way to handle persisting objects in files. • May be repeated. • Create an array of appropriately configured objects.
  10. 10. Data Conversion • As a result of parsing, can take the time to convert data into more appropriate representations. • After pulling numbers in from a file, they’re usually stored as strings. • Can use various conversion functions to clean up representation. • atoi, as an example • Can convert from rough representations to more precise representations.
  11. 11. Example • Consider the following example scenario – calculate the Flesch Readability index of a document. • Need to determine: • Number of sentences • Number of words • Number of syllables in words • Read in as a string from a text file. • Must be parsed.
  12. 12. The Hard Way • Can manipulate a string directly. • Count spaces in a string. • That gives word count, roughly • Count full stops in a string • That gives the number of sentences • Syllable count? • Uh… • Horrors upon horrors • Must parse to get a structure amenable to processing. • An array of strings.
  13. 13. String Processing • Strings contain many useful functions for handling such parsing. • find function gives the location of a particular character. #include <iostream> using namespace std; int main() { string str = "Hello World"; int index; index = str.find ("e", 0); cout << "Found at: " << index << endl; }
  14. 14. String Processing • Can use the substr function of a string to extract a substring from a full string: #include <iostream> using namespace std; int main() { string str = "Hello World"; string sub = str.substr (0, 5); cout << "Substring is: " << sub << endl; }
  15. 15. Working With Strings • Strings also contain a very useful length function. • This tells you how many characters they contain. • Also possible to index a string just like an array. • This lets you get individual characters out of a string. • Can combine these into powerful functions.
  16. 16. Tokenization #include <iostream> using namespace std; int main() { string arr[100]; int size; string sub; string str = "Snausages are snausages for snausages"; int start; size = 0; start = 0; for (int i = 0; i < str.length(); i++) { if (str[i] != ' ' && i != str.length() - 1) { continue; } sub = str.substr (start, i-start); arr[size] = sub; start = i+1; size += 1; } }
  17. 17. Tokenization • There are other ways to tokenize. • This is just one way to show the power of string manipulation. • Serves as a basis for more complex data parsing. • Important to be able to do this – all program representation breaks down into parsing at some point or another.
  18. 18. Object Representation • Can combine tokenization with object representation. • Tokenize individual elements. • Convert them to appropriate data format. • Use accessor methods on an object to configure. • Can easily set up large amounts of objects with this kind of system. • Combine the objects in an array for the best of both worlds.
  19. 19. This Is The End… • With that, it brings us to the end of the scheduled content for C++. • Cheer / Cry as you feel is appropriate. • Next week, we’ll use the time as consolidation time. • Thursday lecture will be a formal revision lecture covering all the topics we have previously met. • Wednesday lecture/tutorial will be a drop in revision session. No planned content, come along with whatever questions you have.
  20. 20. Some Final Thoughts • Programming is hard. • I did warn you at the start! • It’s also a very rare and valuable skill • Which you are moving towards properly building. • It is a skill that requires training. • Like playing a musical instrument or fighting off ninjas. • Important not to let it slide.
  21. 21. Some Final Thoughts • It’s worthwhile keeping a notebook of ‘things I wish I had software to do’. • It can serve as a basis for further exploration of programming. • Don’t worry if you don’t know how to do the things. • Research is a constant part of programming. Nobody knows how to do everything. • Stretching yourself by setting tasks you don’t know how to do is a great way to learn. • Even if you never complete it, the process is valuable.
  22. 22. Summary • Parsing is an important part of software development. • It helps you turn unstructured data into structured data. • Comes in many forms. • String parsing is the most immediately useful of these. • Tokenization is a key parsing technique. • Worth playing about with.

×