13. People don’t always agree with rules
of the game for example
Super Bowl XL
Scott Steinmann
14. A Quiz for you…
On the next slide, I want you to tell
me what these four types of data
have in common
Raise your hand when you get the
answer…
(don’t worry, I won’t call on anyone)
15. “A computer would deserve to be called
intelligent if it could deceive a human into
believing that it was human.”
16. Did you get it right?
Alan Turing
The more data types we have
The harder the classification
17. Classification Cracked The Enigma Code
158,962,555,217,826,360,000
possibilities
Turing used Classification of the data to
narrow the problem set
1st A letter can never be itself
2nd Known Phrases - The weather report
Big Data and Classification – why more than ever, classification and good data architecture is critical to providing confidence in analytical outcomes for your big data project.
Paul Balas
25 + years of data architecture and data-centric implementations from sourcing all types of data, to sourcing, mastering, modeling, and sharing
Implemented numerous content architectures for fortune 500 companies and the Kingdom of Saudi Arabia – KAPSARC
Chair of the Big Data in Denver LinkedIn Group.
He never imagined a world with so much data that information would be obscured simply by the volume and variety of data.
If you can’t categorize your data, you can’t analyze it. If you aren’t performing data profiling on your big data as a first step to your analysis, you’ve already failed.
The earliest known system of classification is that of Aristotle, who attempted in the 4th cent. B.C. to group animals according to such criteria as mode of reproduction and possession or lack of red blood. Aristotle's pupil Theophrastus classified plants according to their uses and methods of cultivation. Little interest was shown in classification until the 17th and 18th cent., when botanists and zoologists began to devise the modern scheme of categories. The designation of groups was based almost entirely on superficial anatomical resemblances.
Machine driven classification can assist in human analysis and refinement of classification systems, but as it stands, without human context, machine driven classification is limited.
Well-known classification systems such as GAAP-based accounting or plant taxonomies provide a common language that is widely accepted and therefor trusted. Common classification systems facilitate understanding and knowledge sharing.
A new Big Data phenomenon is the ‘Data Lake’. I like to call it the ‘Data Swamp’ as the information added to the lake is useless until it’s classified. The excitement around Hadoop and other NoSQL technologies is it allows you to defer classfication, cleansing, and standardization post-load and on the fly, thus making the ingestion process and certain types of analytical workloads much faster
Billions of dollars and tens of thousands of person-years effort has been spent on search technologies all with the focus of classifying data on-the-fly to help people locate precise information. Most of this effort has been driven by the internet search engines and firms trying to capitalize on e-commerce.
Bad categorization of a population has the effect of completely misleading results and creating controversy
NASA:
Ninety-seven percent of climate scientists agree that climate-warming trends over the past century are very likely due to human activities,1and most of the leading scientific organizations worldwide have issued public statements endorsing this position. The following is a partial list of these organizations, along with links to their published statements and a selection of related resources.
The Wall Street Journal
The Myth of the Climate Change '97%'
What is the origin of the false belief—constantly repeated—that almost all scientists agree about global warming?
By
JOSEPH BAST And
ROY SPENCER
Ms. Oreskes's definition of consensus covered "man-made" but left out "dangerous"—and scores of articles by prominent scientists such as Richard Lindzen, John Christy, Sherwood Idso and Patrick Michaels, who question the consensus, were excluded. The methodology is also flawed. A study published earlier this year in Nature noted that abstracts of academic papers often contain claims that aren't substantiated in the papers.
According to IBM, In 2015 Global Data Volume is about 8,000 exabytes
Most of it is sensor and social media data
By 2020 some predict a 5x growth to 40,000 exabytes
Even though he was already ejected from the game, Scott Steinmann continues to argue with the umpires call.
Why were there so many controversial calls in Super Bowl XL? Where the rules for each penalty applied fairly? The outcome of that game was hotly debated.
What was easy for those of you who knew the answer, is exponentially difficult for machines. Each data type has to be parsed and a common taxonomy applied as metadata to the data itself, then correlated to find the commonalities in each data source.
That is 158 Quintillion if you wanted to know.
Was the chad in favor of Bush or Gore?
The risk of the o-ring failure wasn’t correctly classified based on the temperatures it would encounter.
Credit default swap risk wasn’t correctly categorized and risky financial decisions ensued.