CPP18 - String Parsing
Upcoming SlideShare
Loading in...5
×
 

CPP18 - String Parsing

on

  • 27 views

This is an introductory lecture on C++, suitable for first year computing students or those doing a conversion masters degree at postgraduate level.

This is an introductory lecture on C++, suitable for first year computing students or those doing a conversion masters degree at postgraduate level.

Statistics

Views

Total Views
27
Views on SlideShare
27
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

CPP18 - String Parsing CPP18 - String Parsing Presentation Transcript

  • String Parsing Michael Heron
  • Introduction • Having got a string in your system, how do you manipulate it? • Strings are fundamental forms of data representation. • Often obtained from text-files and user input. • Most strings are not in an easily managed form. • The process of parsing is used to render raw data into more refined forms.
  • Parsing • There are many reasons why we may wish to parse data. • Information comes in as a string – we want it in an array. • Information comes in as lists of string numbers, we want them in objects • We are rarely so lucky as to be able to instantly manipulate data that comes in to the system.
  • Data Representation • The absolute most important thing in designing a program is to represent your data right. • If you get this right, everything is easier as a result. • If you get it wrong, everything is more difficult. • Before you ever write a line of code, consider how data must be represented in the system. • What variables, objects and arrays are you going to use?
  • Data Representation • Consider how you are going to need to manipulate the data in the system. • Are you going to need to be able to search through things? • Are you going to need to process each value in turn? • Are you going to need to represent relationships between things? • An easily manipulated data structure is worth its weight in gold.
  • Parsing • Parsing is the process of turning difficult to manipulate data into a more useful format. • Break strings up into all their constituent parts • Convert from multiple arrays into an array of objects • Important first step before more complex processing. • Various standard techniques exist to facilitate this.
  • Common Parsing Tasks • Tokenization • Turn a string into several smaller strings through the use of tokens • Object processing • Breaking multiple data fields out of a single string and configuring an object • Data conversion • Bringing data elements into some common format • Often necessary to combine different processes.
  • Tokenization • Tokenization is the process of splitting up strings. • Based on the idea of a delimiter. • Strings that have a common, delimited structure are amenable to tokenization. • 10,20,30,40 • Jim,Jake,Jane,Johana • Strings are broken up based on the delimiter and the result is an array of strings.
  • Object Processing • Object processing involves the creation of a ‘blank’ object and setting its attributes as a result of input. • Often done after tokenization of input. • The end result is an object configured as desired. • One way to handle persisting objects in files. • May be repeated. • Create an array of appropriately configured objects.
  • Data Conversion • As a result of parsing, can take the time to convert data into more appropriate representations. • After pulling numbers in from a file, they’re usually stored as strings. • Can use various conversion functions to clean up representation. • atoi, as an example • Can convert from rough representations to more precise representations.
  • Example • Consider the following example scenario – calculate the Flesch Readability index of a document. • Need to determine: • Number of sentences • Number of words • Number of syllables in words • Read in as a string from a text file. • Must be parsed.
  • The Hard Way • Can manipulate a string directly. • Count spaces in a string. • That gives word count, roughly • Count full stops in a string • That gives the number of sentences • Syllable count? • Uh… • Horrors upon horrors • Must parse to get a structure amenable to processing. • An array of strings.
  • String Processing • Strings contain many useful functions for handling such parsing. • find function gives the location of a particular character. #include <iostream> using namespace std; int main() { string str = "Hello World"; int index; index = str.find ("e", 0); cout << "Found at: " << index << endl; }
  • String Processing • Can use the substr function of a string to extract a substring from a full string: #include <iostream> using namespace std; int main() { string str = "Hello World"; string sub = str.substr (0, 5); cout << "Substring is: " << sub << endl; }
  • Working With Strings • Strings also contain a very useful length function. • This tells you how many characters they contain. • Also possible to index a string just like an array. • This lets you get individual characters out of a string. • Can combine these into powerful functions.
  • Tokenization #include <iostream> using namespace std; int main() { string arr[100]; int size; string sub; string str = "Snausages are snausages for snausages"; int start; size = 0; start = 0; for (int i = 0; i < str.length(); i++) { if (str[i] != ' ' && i != str.length() - 1) { continue; } sub = str.substr (start, i-start); arr[size] = sub; start = i+1; size += 1; } }
  • Tokenization • There are other ways to tokenize. • This is just one way to show the power of string manipulation. • Serves as a basis for more complex data parsing. • Important to be able to do this – all program representation breaks down into parsing at some point or another.
  • Object Representation • Can combine tokenization with object representation. • Tokenize individual elements. • Convert them to appropriate data format. • Use accessor methods on an object to configure. • Can easily set up large amounts of objects with this kind of system. • Combine the objects in an array for the best of both worlds.
  • This Is The End… • With that, it brings us to the end of the scheduled content for C++. • Cheer / Cry as you feel is appropriate. • Next week, we’ll use the time as consolidation time. • Thursday lecture will be a formal revision lecture covering all the topics we have previously met. • Wednesday lecture/tutorial will be a drop in revision session. No planned content, come along with whatever questions you have.
  • Some Final Thoughts • Programming is hard. • I did warn you at the start! • It’s also a very rare and valuable skill • Which you are moving towards properly building. • It is a skill that requires training. • Like playing a musical instrument or fighting off ninjas. • Important not to let it slide.
  • Some Final Thoughts • It’s worthwhile keeping a notebook of ‘things I wish I had software to do’. • It can serve as a basis for further exploration of programming. • Don’t worry if you don’t know how to do the things. • Research is a constant part of programming. Nobody knows how to do everything. • Stretching yourself by setting tasks you don’t know how to do is a great way to learn. • Even if you never complete it, the process is valuable.
  • Summary • Parsing is an important part of software development. • It helps you turn unstructured data into structured data. • Comes in many forms. • String parsing is the most immediately useful of these. • Tokenization is a key parsing technique. • Worth playing about with.