I’m going to assume that I don’t have to convince you that there’s a lot more Volume now. I could say “since you got up this morning, more data has been created than all the data created from the time the first cuneiform writing was invented up through some surprisingly recent historical event” but we’ve all be hearing those stories a lot lately.
A related issue is Velocity. One of the reasons that there’s a greater volume is that more devices are generating data, and some of them very quickly because it’s cheaper to do so. Sensors to measure how much liquid is going through a pipe or whether a window is open are less expensive to make, so people are making them and having them send data.
OPTIONAL: The classic example is a modern smartphone, which besides measuring your geo location can also record things like they angle that you’re holding it, not to mention the things you’re doing on the phone. When I install an app on my phone that doesn’t need permission to read or write any special data, it’s always a pleasant surprise because the default is that so many of them do. Industrial processing and an increasing number of household devices are taking greater advantage of inexpensive devices that can record things and then pass along what they record, and because the computation and transmission is cheap, they can do it a lot, so they do.
Variety: people want to learn things by combining different kinds of data and looking for patterns. With big data efforts, people often want to combine two data sets that have only one or two fields in common, and then they can use those two fields as connections to look for interesting patterns, but forging those connections is not typically very easy. I’m going to talk more about this shortly because the Velocity V is really the focus of my talk.
The research firm formerly known as “The Gartner Group”
These are classic old-fashioned data integration problems, but they’re an issue with big data projects because people want to integrate more databases more often, sometimes just temporarily to see if anything interesting results.
1.3. Efficiency of development (see 1.2.) and execution, because you can create indexes based on schemas.
2.1 If I want to add a formerEmployer property to note that someone used to work at one of our customers…
2.3. The SQL standard does specify a way to list a database’s tables, but Oracle and DB2 don’t follow it, and have their own way. http://troels.arvin.dk/db/rdbms/
3. Many popular NoSQL database managers offer some schema-like features, like MongoDB’s data models and Neo4J’s constraints, but these are obviously very implementation-specific.
1.3 the NoSQL database is typically assembled to play a specific role in a specific database, as opposed to providing a general-purpose database.
We’ve seen some advantages of using schemas and some advantages of not using them.
The choice has often been this: are you going to have a description of every single database field, or are you going to go with no description of any of them?
This is a tiny example to fit on the slide. What if I have 12 databases with a hundred properties each? What if I want the advantages that we saw of schema but I’m only interested a combination of 8 fields from one database, 12 from another, and 2 from another? Do I have to choose between using the 12 entire schemas or no schemas at all? How can I use schemas as metadata to drive my use of the specific subset of data that I’m interested in?
ETL? We can move this intelligence into program code, but then it’s code, as opposed to re-usable metadata. But, code is less re-usable than schema metadata, and it also doesn’t age well. It’s a lot easier to picture twenty-year-old data or metadata being useful today than twenty-year-old code. Plus, you’re copying data and changing it (transforming it) along the way, which introduces the possibility of errors, and your have to plan around the likely possibility of the copy becoming out of date.
4. Often associated with Semantic Web or Linked Data technologies. I’m happy to talk about those, but I’m not here to talk about them today. I’m here to talk about how RDFS (and if you like, a little OWL and the associated RDF query language SPARQL) can make it easier to flexibly deal with a variety of data.
(After describing slide)
We haven’t even gotten to the RDFS standard yet, and are just using standard parts of RDF.
So far, so what? We’ve listed the properties that we’re interested in, in a machine-readable standardized way. For one thing, I can look at this and it can guide me in the writing of a query, because I see what the available properties are. Even better, a program that’s going to generate a form—for example, a search form for this data—can read this schema and generate just such a form.
But let’s look at some more interesting things we can do.
There are ways in RDFS to assert that we want to treat surname in one database the same as last name in the other, but it’s even better to relate them to a common one—a standard one if available, and here you can see that I’ve used properties from schema.org, or one that you make up for this purpose.
Here we have implemented a simple little bit of data integration to deal with the variety of names in the different data sources.
I can search and use the data using these property names (on the right) and it will actually use the data from these property names (on the left).
With most NoSQL applications that I know of, “querying” data means writing code in a scripting language. Some of the tools have their own special query languages, but SPARQL is a standard, and a well-implemented one.
The SPARQL query is for querying RDF triples, and our original data was not in triples. How can we query it with triples?
(After last build)
DON’T BOTHER WITH THIS: To actually act on the schema metadata—that is, to have the application know that it should treat the customer surnames and the conference attendee last names as schema.org family names—requires an inferencing step, there are plenty of commercial and open source tools that can do that. It can even be done with SPARQL queries. The important thing is, it’s all done with documented standards that have implementations and traction.
I’m going to take this little data integration schema that I’ve been developing and enhance it even more by just adding a few more statements.
Remember, schema:postalCode stands in for a full URI.
rdfs:domain statements can be used by an application generating a report or an editing form.
So far we’ve seen that RDFS gives us ways to list properties and classes and to say things about them in a machine-readable way so that applications can use that data. OWL lets us say more things.
This shows some of the triples that a program like D2R might generate from these tables.
There wasn’t a “located in” property in schema.org, so I declared one myself.
Read through triples, pointing back at tables. “But if locatedIn encompasses both the room and building properties, and locatedIn is transitive, I can just query on locatedIn values to find out what building that piece of equipment is in…”
…with a very simple query. I don’t have to specify any joins or look up foreign keys or anything.
2.1. we saw some RDFS ones like domain and range; OWL gives you new ones like sameAs from the previous slide
2.2. For example, transitiveProperty is a class, and I said that locatedIn…
2.3. I could define a class called NewCustomers as the set of all customers whose first purchase was in the last 90 days, then use that class to drive decisions about which customers get which communications from the company.
This last category is where OWL can be particularly powerful, but also somewhat intimidating. There’s a lot that you can get out of the first two categories.
Returning to this slide to emphasize that while mapping middleware can generate a lot of schema metadata for you, the ability to add more metadata to that, about the fields you’re interested in and only those fields—is very powerful. (build)
The metadata lets you tie it all together, or just tied the bits you’re interested in together, using a documented standard with a wide choice of implementations. This is the real key to handling the variety.
1. A way to say “That data may have been created for one particular application or another, but here’s what I need it for.” 2. If I describe my products for sale using the GoodRelations schema, I can more easily combine my product data with product data from other companies and automate how I sell it using a website or app 3. One example is the way that an earlier slide said that the surname property from the customer database was a subproperty of the family name property from schema.org… 4. … and that lets me (read bullet) Which is ultimately what my presentation here is about.
2. Bottom up was not necessarily an option 15 years ago. You planned a whole system at a high level and then filled in details before you could do any development before you took advantage of the model. 3.1 Does one data source have 25 tables with dozens of columns in each? Pick the ones that you need for you application and model those.
You don’t have to start with weeks of planning. You can start prototyping at a small scale and build organically from there.
1. Research data, clinical trials, standardized and internal taxonomies, 2. combine sets of production, exploration, and environmental data 3. Looking for new income sources outside of printed books—combining content in different forms from different subsidiaries with different CMSs and other systems and, in the education market that is particulary important to them, lining it up with standards
In a talk like this, it’s more traditional to tell you about the company at the beginning of the talk, but I wanted to wait until the end because you have more context. When I joined the company…
Because of the nature of this conference, and track, I’ve gone into some more of the geeky details about how the standards work and make this kind of integration possible. TopBraid Insight provides a front end that takes advantage of these capabilities of the standards but keeps the geekier details under the hood so that business users can take advantage of them with an intuitive interface.
We have a webinar online
Before I finish I wanted to be a good web citizen and credit the pictures I used on my second slide…
I’d like to finish with this quote from Heraclitus, who lived in the sixth century BC, because it so nicely sums up how if we connect up things that are seemingly unconnected, we can end up with some great new possibilities.
Semantic Web Standards and the Variety “V” of Big Data