Finding and Making Data Prof. Alvarado MDST 3705 12 February 2013
Business• Quizzes by Friday• Safari Resources – When off grounds, use VPN or access from the Library web page – It should allow you to log on to the resource
Big Data• What is Big Data? – Data produced by governments, corporations, scientific instruments, transactions … – Captured by databases• Databases are at the foundation of almost all digital products we use – Social Media, from Facebook to WordPress – Learning Management Systems (e.g. Collab) – Video Games and Simulations – Maps and Timelines
The Digital Humanities has entered the era of Big Data Numerous collections of primary andsecondary sources have been digitized over the last two decades To do scholarship, you need to both produce and consume data
Databases• We can also use relational databases to ingest data sets from the wild• Once they are in the database, we may modify them to conform to our own data model• And we may combine them to produce new data• The database becomes a recombinant space for creating data mash ups
The database is also a machine for making inferences …
This query is an exampleof how two tables can be"joined" into a third table.It also shows how you canmanipulate the data onthe fly to produce newresults.
Quick Note• MySQL uses two kinds of quotes – Double and single to wrap strings – ―Backticks‖ ( ` ) are used sometimes to wrap table and field names – E.g. SELECT `Country` FROM `country_debt`• Back ticks are used to allow spaces in field and table names – But this is a bad practice; I do not encourage spaces – Therefore backticks are optional
Just as we saw with Aristotle’slogic, relational databases allow us todevelop ontologies from which we candraw inferences
We can see that each of table we imported actually stands for an assertion(The conclusion in this case is simply a correlation)
I felt like the strategy for database design explained in thereading on SQL ran quite contrary to my understanding ofthe ―hacker‖ mentality, and I think it speaks to the lackof flexibility in the SQL database system. . . . Databasedesigners [are] encouraged to map everything out beforeeven thinking about beginning construction on the actualdatabase.This is true – the book does project a planning ethosat odds with the spirit of hacking and iterativebuilding. This is as it should be – experiencedprogrammers and database designers do valueplanning. But building databases can be organic andcreative too, especially when we the domain beingmodeled is not well understood, which is often thecase with the digital humanities.
Remember that in the digitalhumanities, we are reverse engineering culture from media Instead of planning a data model, we need to extract and evolve one But we can use the tools of database design to help us
Making data is more than adding data to a databaseYou first have to create the database All good databases are based onmodels, which we view as knowledge representations
Learning MySQL• Provides the right level of information – But follows traditional planning model – Our approach is a bit different – Introduces useful vocabulary• Key idea in Chapter 3 is use of Entity Relationship Diagrams – E-R diagrams – I use a simplified version
Database Design• Process 1 (Planned) – Gather requirements – Create an ER model – data model – Translate into tables – database schema• Process 2 (Evolved) – Gather data – Find implicit relations – Create new tables – Create ER model – Translate into tables
The simplest case of two entities with a relationship.We dont specify the nature of the relationship at thispoint. For example, A might stand for PERSON and Bmight stand for BOOK, as in PERSON READS BOOK.
This includes the cardinality of therelationship. A relates to 1 or more (or 0 ormore) of B. For example, PERSON READSMANY BOOKS.
This shows a Many-to-Many relationship (M:M, orM:N). MANY PERSONS READ MANY BOOKS.That is, a given PERSON may read more than oneBOOK, and a given BOOK may be read by morethan one PERSON.
This implies the creation of a third entity, C, tocapture the BOOK / PERSON relationship. Wecan think of this as a kind of EVENT -- ourdatabase will capture all instances, say, ofPEOPLE reading BOOKS.
Now, in the case of our two tables,we have the following impliedmodel. (The single arrow headsimply a Subject/Object relation.)
After thinking about this model some, we can seethat COUNTRY actually has a 1:M relationship toDEBT, since the latter varies by year. (We canimagine a DEBT table with an AMOUNT field and aYEAR field.) We also know that eachSOCIALNETWORK can be related to more thanone COUNTRY.
In the end, our model will look something likethis. So we will need to create tables to matchthese entities, e.g. COUNTRY,DEBT_OF_COUNTRY, SOCIALNETWORK,SOCIALNETWORK_OF_COUNTRY
E-R Rules• Entities and Attributes – Entities are definitions of things that have some ―integrity‖ – Attributes are like properties of things – The difference can be logical or practical• Relations and Cardinality – Relations exist between Entities – They are like assertions—PERSON read BOOK – Relations have ―cardinality‖ which gives clues about the data model• Uniqueness and keys – Entities are uniquely defined by certain attributes
Mapping ER Diagrams to TablesCardinality matters:1:1 Same table, with exceptions1:M Two tables, table A has keyM:1 Two tables, table B has foreign keyM:M Third table of foreign keys