BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean?
1. BEYOND THE
BUZZWORDS
BIG DATA, MACHINE LEARNING –
WHAT DOES IT ALL MEAN?
Trones14@gmail.com
Database Concepts In Business Intelligence
August 4, 2016
2. Table of Contents
Introduction.....................................................................................................................................................................1
Categorizing Data : How to Think About Your Organizations Data .............................................................2
Defining Big Data ..........................................................................................................................................................3
Primer on Machine Learning .....................................................................................................................................5
NoSQL Database Types ...............................................................................................................................................7
Scaling...............................................................................................................................................................................7
Conclusion.......................................................................................................................................................................9
References.......................................................................................................................................................................10
3. PAGE 1
Introduction
“Data scientists spend anywhere from 50-80% of their time cleaning up data sets in order
to find usable insights (Lohr, 2014, NYT)”. I can personally attest to the accuracy of this
statement. I recently built a “Tableau Jobs” visualization. The visualization took about 30
minutes, but writing the script to use bulk API calls and correctly saving to csv took
hours.
There are a number of developments that needed to occur for machine learning to
become a reality. Machine learning (ML) only has commercial applications due to two
reasons:
1. Plenty of publicly usable data (stored in less than structured formats) to feed into
the ML models
2. Computing advances that allow these models to be trained in a relatively short
period of time (days or weeks)
This paper:
1. Categorizes data based on some characteristics:
o Degree of structure
o Source of data (internal or external)
2. Explains what is really meant by “big data” in the common vernacular – The link
between big data and machine learning – NoSQL storage
3. Introduces Machine learning
4. Explains why NoSQL is the choice for developers
4. PAGE 2
CategorizingData : How to Think About Your OrganizationsData
Data can be categorized into 3 groups: structured, semi-structured, and unstructured.
There is one more key distinction: internal or external. Internal data should be structured.
If a company designs the data collection system, then this data can have structure at the
time of data generation. Structure at the time of generation is the best case scenario
because “Data scientists spend anywhere from 50-80% of their time cleaning up data sets
(creating structure) in order to find usable insights (Lohr, 2014)”
Internal External
Labor hours to Insight
Structured Semi Structured Unstructured
Process:
Visualize!
Process:
Transform/Clean
Load
Visualize!
Process:
Find sources
Write scripts to extract
Transform/Clean
Load
Visualize!
5. PAGE 3
Defining Big Data
According to Johnathan Ward & Adam Barker of the University of St. Andrews, “all
definitions (of big data) make at least one of the following assertions:
Size: the volume of the datasets is a critical factor.
Complexity: the structure, behavior and permutations of the datasets is a critical
factor.
Technologies: the tools and techniques which are used to process a sizable or
complex dataset is a critical factor.” (Becker, Ward, 2013, Undefined by Data)
Barker and Ward then propose a definition, “Big data is a term describing the storage
and analysis of large and or complex data sets using a series of techniques including, but
not limited to: NoSQL, MapReduce and machine learning. ” (Becker, Ward, 2013)
Let’s dive into some reasons why size, complexity, and technologies are all defining
features of big data:
1. Size: The choice of storage, cleaning/transformation, and analysis tools
depend on size:
a. Small data is not a concern.
i. Storage & Analysis: The computational power required to
handle such small datasets is easily achieved with personal
computers, there is no need to scale your job across many
computing clusters if it doesn’t save time. Easy analysis
means that storage choice is not a concern.
6. PAGE 4
ii. Cleaning Example: It may be quicker to hand clean data in
excel using simple find and replace statements rather than
writing a script.
2. Complexity: Big structured data is not the issue. With a relational
structure we can use SQL to easily find what we want. This data is
formatted according to the specifications of the database and needs few
modifications before it is ready to be analyzed. This data is usually not as
big as semi-structured or unstructured data because it is normalized, there
are no redundancies. This means that it is usually computationally easy to
analyze this data without having to scale horizontally (adding compute
clusters).
3. Technologies: This is how we store big data (NoSQL) and how we
analyze it (Machine Learning).
Our diagram has been narrowed down.
When people speak of big data, they are
usually talking about external semi-
structured or unstructured data. It is this
type of data that can be used by anyone for
machine learning models.
7. PAGE 5
Primer on Machine Learning
Machine learning is perhaps the most
misleading buzz word ever created. What’s
the difference between machine learning
and data science or statistics? Why are
machine learning and Big Data gaining
popularity at the same time? What is the
relationship between the two?
One common way to categorize machine learning (ML) is into supervised ML and
unsupervised ML. When I first began diving into the tools and algorithms of machine
learning, they seemed quite similar to predictive and descriptive statistics.
1. Supervised ML breaks the data into two sets: train and test. The model is
built/trained on the train set, and then accuracy of the model is tested on
the test set. We are interested in how well the model predicts the actual
values found in the test set.
2. Unsupervised ML deals with finding hidden structure in data without
giving the model any output goal. So what’s the difference between this
and descriptive statistics?
8. PAGE 6
Aatash Shah of Edvancer.in gives us some insight:
“Robert Tibshirani, a statistician and machine learning expert at Stanford,
calls machine learning ‘glorified statistics’… …Both machine learning and
statistics share the same goal: Learning from data. Both these methods focus on
drawing knowledge or insights from the data… … Cheap computing power and
availability of large amounts of data allowed data scientists to train computers to
learn by analyzing data. But, statistical modeling existed long before computers
were invented.” (Shah, Aatash, 2016, Edvancer.in)
Going back to our original question, What is the relationship between machine learning
and big data? There are a number of developments that needed to occur for machine
learning to become a reality. Without the data explosion caused by the internet, the
development of NoSQL databases, and the computing advances achieved through
Moore’s law, GPGPUs, and horizontal scaling of compute clusters – machine learning
would be restricted to the Academic realm; impractical for the majority of commercial
purposes.
This brings us to the linchpin of the entire discussion. The external data out there on
the internet is stored in a format that is best for the application developer ---
NoSQL.
9. PAGE 7
NoSQL Database Types
(Habib, 2015, Appdynamics.com)
Scaling
“Achieving scalability and elasticity is a huge challenge for relational databases.
Relational databases were designed in a period when data could be kept small, neat, and
orderly.” (Allen, 2015, Marklogic.com) Relational databases are designed with the data
in mind. This is done to avoid duplication, to normalize the data through the relational
structure. Imposing a relational structure at development time severely limits the software
developers’ flexibility for future versions of their application. The popularity of iterative
10. PAGE 8
agile like software development life cycles (SDLC) only exacerbates the disadvantages of
RDBMS’s.
Relational NoSQL (Document)
(Allen, 2014, Marklogic.com)
Pay particular attention to the Data Model. Remember my story about pulling JSON and
transforming it? This data was likely pulled from a document database. If the site owner
decided to make a major change to the data that was included, this would be a simple
change in their document database. If they were using a relational structure, they
might have to go in and totally redesign the entire structure. As a job search board
Indeed.com will have to scale their storage & compute power up and down based on the
web traffic and amount of job postings. Scaling back down is virtually impossible with a
relational structure.
11. PAGE 9
Conclusion
There are 3 categories of data: structured, semi-structured, & unstructured.
There are two sources of data: internal & external.
The challenges associated with deriving insights from data apply to external
data that is semi or unstructured.
The term “Big Data” refers to volume, but also encompasses the storage
technologies (NoSQL) and analysis tools (Machine Learning) because they are
an integral to the big data ecosystem.
Big Data is stored in less-structured NoSQL DBs for web developer agility.
Final Statement: Machine learning is becoming democratized due to the availability of
large amounts of less than structured data and cheap compute power. Although it would
be easier for data scientists to work with structured data, this will never happen because
developers need to use NoSQL databases for business requirements such as agility and
scalability.
12. PAGE 10
References
Barker, A., & Ward, J. S. (2013, September 20). Undefined By Data: A Survey of Big Data Definitions
[Scholarly project]. Retrieved from http://arxiv.org/abs/1309.5821
Habib, O. (2015, September 21) A Newbie Guide to Databases. Retrieved from
https://blog.appdynamics.com/database/a-newbie-guide-to-databases/
Lohr, S. (2014). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved July 20,
2016, from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-
to-insights-is-janitor-work.html?_r=1
Lopez, K., & D'Antoni, J. (2014). The Modern Data Warehouse--How Big Data Impacts Analytics
Architecture. Business Intelligence Journal, 19(3), 8-15.
Machine Learning Algorithms Image. Retrieved from,
https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=c9sqnpazsd
7pmpusegzy
Machine learning frees up data scientists' time, simplifies smart applications - TechRepublic. (2015,
December 14). Retrieved July 20, 2016, from http://www.techrepublic.com/article/machine-
learning-frees-up-data-scientists-time-and-simplifies-smart-applications/
Making Sense of NoSQL. (n.d.). Retrieved July 21, 2016,
from http://macc.foxia.com/files/macc/files/macc_mccreary.pdf
Relational Databases Are Not Designed For Scale | MarkLogic. (2015, November 09). Retrieved July
23, 2016, from http://www.marklogic.com/blog/relational-databases-scale/
Shah, A. (2016, August 1) Machine Learning vs. Statistics Retrieved from,
http://www.edvancer.in/machine-learning-vs-statistics/