5. The Value of Data
… is still being decided
Book Value: $13 billion
Market Value: $114 billion
= 1.3 billion MAUs
~500 terabytes of data added… per day
$101 billion in data
@BlairReeves
6. A Short History of Data
300 B.C.
Great Library of Alexandria (Egypt)
970 A.D.
Al-Azhar University (Egypt)
1400
Cambridge University owns 122 books
1450s
Invention of the Gutenberg printing
press
1520s
Martin Luther translates the Latin Bible, accelerating
mass literacy
1710
Copyright law is born
1770s
Press freedom guarantees; pamphleteering
1890
Herman Hollerith invents machine-readable
data for U.S. Census
1969
ARPANET – first TCP/IP Protocol
2013
Watson
~2.8 billion global internet users
(40% of world’s population)
@BlairReeves
7. The Way We Use Data Will Change
Trade Exactitude for Size
Why Sample?
Correlation Over Causality
@BlairReeves
8. 1 – Trade Exactitude for Size
Precision < Size
More data > Better algorithms
@BlairReeves
9. 1 – Trade Exactitude for Size
1954
1990
250 word pairs
2006
3 million word pairs
>100 billion word pairs
(and counting)
@BlairReeves
10. 2 – Why Sample?
• Sampling relies on randomness
• Difficult to drill down into
subcategories
• Requires careful pre-planning
@BlairReeves
No strict definition of the term – merely refers to the process (or capability) of analyzing datasets so large that they couldn’t previously fit into computer memory. This is where we got Google MapReduce and Hadoop. Technology companies who pioneered these techniques thus were able to extract unique new value from huge troves of data that many “offline” companies in a wide number of sectors had kept for years.
Today, up to a third of Amazon’s online revenue is derived from its personalization and recommendations engine.Case studiesYou can cite any number of case studies about how innovative companies have been able to extract new value from large, previously unremarkable datasets. But in any of these cases, what we see is that data has become the newest natural resource, and it’s being exploited to create new markets.
Interestingly, guess how many companies have a line item on their balance sheets for “data?” None. FB is one of the single best examples of this mismatch between traditional systems of financial value and new ones. Intangible assets 40% of value of public companies in 1980s; 75% of their value in 2010s
As human societies consume, generate and process more data, our political, legal and conceptual models must change along with them. While it took hundreds of years for mass literacy and printed information to change Western civilization, we are now living in an era where amounts of and access to data are completely unprecedented. It will change how we think about the nature of information itself.8M books printed from 1453 to 1503Hollerith shrunk tabulating times for the U.S. Census from 8 years to <1.
Interestingly, guess how many of the companies listed here have a line item on their balance sheets for “data?” None.
Collecting more data, more often frequently means sacrificing some level of precision. At large scale, accepting some noise – messiness – in exchange for collecting a larger dataset can mean better predictive power.NoSQL
IBM 701 Machine – punch card system. Translated 60 sentences smoothly.IBM Candide – ten years worth of Canadian parliamentary transcripts. Ultimately was difficult to scale due to lack of additional data.Google Translate uses billions of websites, book-scanning project. In 2013, covers more than 60 languages.
Sampling is sometimes a definitional characteristic of what qualifies as “big data” – whether we’re querying an entire dataset rather than a select part of it.Sampling is still very useful sometimes, but always as a second-best alternative to querying an entire dataset. Artifact of data-constrained environment where storage and processing power was sharply limited
Up to a third of all Amazon’s sales are a result from its recommendation and personalization engines. These product-to-product correlations matter far more than understanding WHY customers who buy one product like another.