SlideShare a Scribd company logo
1 of 12
BEYOND THE
BUZZWORDS
BIG DATA, MACHINE LEARNING –
WHAT DOES IT ALL MEAN?
Trones14@gmail.com
Database Concepts In Business Intelligence
August 4, 2016
Table of Contents
Introduction.....................................................................................................................................................................1
Categorizing Data : How to Think About Your Organizations Data .............................................................2
Defining Big Data ..........................................................................................................................................................3
Primer on Machine Learning .....................................................................................................................................5
NoSQL Database Types ...............................................................................................................................................7
Scaling...............................................................................................................................................................................7
Conclusion.......................................................................................................................................................................9
References.......................................................................................................................................................................10
PAGE 1
Introduction
“Data scientists spend anywhere from 50-80% of their time cleaning up data sets in order
to find usable insights (Lohr, 2014, NYT)”. I can personally attest to the accuracy of this
statement. I recently built a “Tableau Jobs” visualization. The visualization took about 30
minutes, but writing the script to use bulk API calls and correctly saving to csv took
hours.
There are a number of developments that needed to occur for machine learning to
become a reality. Machine learning (ML) only has commercial applications due to two
reasons:
1. Plenty of publicly usable data (stored in less than structured formats) to feed into
the ML models
2. Computing advances that allow these models to be trained in a relatively short
period of time (days or weeks)
This paper:
1. Categorizes data based on some characteristics:
o Degree of structure
o Source of data (internal or external)
2. Explains what is really meant by “big data” in the common vernacular – The link
between big data and machine learning – NoSQL storage
3. Introduces Machine learning
4. Explains why NoSQL is the choice for developers
PAGE 2
CategorizingData : How to Think About Your OrganizationsData
Data can be categorized into 3 groups: structured, semi-structured, and unstructured.
There is one more key distinction: internal or external. Internal data should be structured.
If a company designs the data collection system, then this data can have structure at the
time of data generation. Structure at the time of generation is the best case scenario
because “Data scientists spend anywhere from 50-80% of their time cleaning up data sets
(creating structure) in order to find usable insights (Lohr, 2014)”
Internal External
Labor hours to Insight
Structured Semi Structured Unstructured
Process:
Visualize!
Process:
Transform/Clean
Load
Visualize!
Process:
Find sources
Write scripts to extract
Transform/Clean
Load
Visualize!
PAGE 3
Defining Big Data
According to Johnathan Ward & Adam Barker of the University of St. Andrews, “all
definitions (of big data) make at least one of the following assertions:
 Size: the volume of the datasets is a critical factor.
 Complexity: the structure, behavior and permutations of the datasets is a critical
factor.
 Technologies: the tools and techniques which are used to process a sizable or
complex dataset is a critical factor.” (Becker, Ward, 2013, Undefined by Data)
Barker and Ward then propose a definition, “Big data is a term describing the storage
and analysis of large and or complex data sets using a series of techniques including, but
not limited to: NoSQL, MapReduce and machine learning. ” (Becker, Ward, 2013)
Let’s dive into some reasons why size, complexity, and technologies are all defining
features of big data:
1. Size: The choice of storage, cleaning/transformation, and analysis tools
depend on size:
a. Small data is not a concern.
i. Storage & Analysis: The computational power required to
handle such small datasets is easily achieved with personal
computers, there is no need to scale your job across many
computing clusters if it doesn’t save time. Easy analysis
means that storage choice is not a concern.
PAGE 4
ii. Cleaning Example: It may be quicker to hand clean data in
excel using simple find and replace statements rather than
writing a script.
2. Complexity: Big structured data is not the issue. With a relational
structure we can use SQL to easily find what we want. This data is
formatted according to the specifications of the database and needs few
modifications before it is ready to be analyzed. This data is usually not as
big as semi-structured or unstructured data because it is normalized, there
are no redundancies. This means that it is usually computationally easy to
analyze this data without having to scale horizontally (adding compute
clusters).
3. Technologies: This is how we store big data (NoSQL) and how we
analyze it (Machine Learning).
Our diagram has been narrowed down.
When people speak of big data, they are
usually talking about external semi-
structured or unstructured data. It is this
type of data that can be used by anyone for
machine learning models.
PAGE 5
Primer on Machine Learning
Machine learning is perhaps the most
misleading buzz word ever created. What’s
the difference between machine learning
and data science or statistics? Why are
machine learning and Big Data gaining
popularity at the same time? What is the
relationship between the two?
One common way to categorize machine learning (ML) is into supervised ML and
unsupervised ML. When I first began diving into the tools and algorithms of machine
learning, they seemed quite similar to predictive and descriptive statistics.
1. Supervised ML breaks the data into two sets: train and test. The model is
built/trained on the train set, and then accuracy of the model is tested on
the test set. We are interested in how well the model predicts the actual
values found in the test set.
2. Unsupervised ML deals with finding hidden structure in data without
giving the model any output goal. So what’s the difference between this
and descriptive statistics?
PAGE 6
Aatash Shah of Edvancer.in gives us some insight:
“Robert Tibshirani, a statistician and machine learning expert at Stanford,
calls machine learning ‘glorified statistics’… …Both machine learning and
statistics share the same goal: Learning from data. Both these methods focus on
drawing knowledge or insights from the data… … Cheap computing power and
availability of large amounts of data allowed data scientists to train computers to
learn by analyzing data. But, statistical modeling existed long before computers
were invented.” (Shah, Aatash, 2016, Edvancer.in)
Going back to our original question, What is the relationship between machine learning
and big data? There are a number of developments that needed to occur for machine
learning to become a reality. Without the data explosion caused by the internet, the
development of NoSQL databases, and the computing advances achieved through
Moore’s law, GPGPUs, and horizontal scaling of compute clusters – machine learning
would be restricted to the Academic realm; impractical for the majority of commercial
purposes.
This brings us to the linchpin of the entire discussion. The external data out there on
the internet is stored in a format that is best for the application developer ---
NoSQL.
PAGE 7
NoSQL Database Types
(Habib, 2015, Appdynamics.com)
Scaling
“Achieving scalability and elasticity is a huge challenge for relational databases.
Relational databases were designed in a period when data could be kept small, neat, and
orderly.” (Allen, 2015, Marklogic.com) Relational databases are designed with the data
in mind. This is done to avoid duplication, to normalize the data through the relational
structure. Imposing a relational structure at development time severely limits the software
developers’ flexibility for future versions of their application. The popularity of iterative
PAGE 8
agile like software development life cycles (SDLC) only exacerbates the disadvantages of
RDBMS’s.
Relational NoSQL (Document)
(Allen, 2014, Marklogic.com)
Pay particular attention to the Data Model. Remember my story about pulling JSON and
transforming it? This data was likely pulled from a document database. If the site owner
decided to make a major change to the data that was included, this would be a simple
change in their document database. If they were using a relational structure, they
might have to go in and totally redesign the entire structure. As a job search board
Indeed.com will have to scale their storage & compute power up and down based on the
web traffic and amount of job postings. Scaling back down is virtually impossible with a
relational structure.
PAGE 9
Conclusion
 There are 3 categories of data: structured, semi-structured, & unstructured.
 There are two sources of data: internal & external.
 The challenges associated with deriving insights from data apply to external
data that is semi or unstructured.
 The term “Big Data” refers to volume, but also encompasses the storage
technologies (NoSQL) and analysis tools (Machine Learning) because they are
an integral to the big data ecosystem.
 Big Data is stored in less-structured NoSQL DBs for web developer agility.
Final Statement: Machine learning is becoming democratized due to the availability of
large amounts of less than structured data and cheap compute power. Although it would
be easier for data scientists to work with structured data, this will never happen because
developers need to use NoSQL databases for business requirements such as agility and
scalability.
PAGE 10
References
Barker, A., & Ward, J. S. (2013, September 20). Undefined By Data: A Survey of Big Data Definitions
[Scholarly project]. Retrieved from http://arxiv.org/abs/1309.5821
Habib, O. (2015, September 21) A Newbie Guide to Databases. Retrieved from
https://blog.appdynamics.com/database/a-newbie-guide-to-databases/
Lohr, S. (2014). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved July 20,
2016, from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-
to-insights-is-janitor-work.html?_r=1
Lopez, K., & D'Antoni, J. (2014). The Modern Data Warehouse--How Big Data Impacts Analytics
Architecture. Business Intelligence Journal, 19(3), 8-15.
Machine Learning Algorithms Image. Retrieved from,
https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=c9sqnpazsd
7pmpusegzy
Machine learning frees up data scientists' time, simplifies smart applications - TechRepublic. (2015,
December 14). Retrieved July 20, 2016, from http://www.techrepublic.com/article/machine-
learning-frees-up-data-scientists-time-and-simplifies-smart-applications/
Making Sense of NoSQL. (n.d.). Retrieved July 21, 2016,
from http://macc.foxia.com/files/macc/files/macc_mccreary.pdf
Relational Databases Are Not Designed For Scale | MarkLogic. (2015, November 09). Retrieved July
23, 2016, from http://www.marklogic.com/blog/relational-databases-scale/
Shah, A. (2016, August 1) Machine Learning vs. Statistics Retrieved from,
http://www.edvancer.in/machine-learning-vs-statistics/

More Related Content

What's hot

6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictiveEditorJST
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...DATAVERSITY
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligencehktripathy
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...Edureka!
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesKaran Deep Singh
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Toshiyuki Shimono
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3varshakumar21
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAIJMIT JOURNAL
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutionscsandit
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldSeth Grimes
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data WarehousesTom Donoghue
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data ScienceJason Geng
 

What's hot (20)

Introduction to Data Analytics
Introduction to Data AnalyticsIntroduction to Data Analytics
Introduction to Data Analytics
 
6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive6.a survey on big data challenges in the context of predictive
6.a survey on big data challenges in the context of predictive
 
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
Smart Data Slides: Data Science and Business Analysis - A Look at Best Practi...
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
 
Lecture3 business intelligence
Lecture3 business intelligenceLecture3 business intelligence
Lecture3 business intelligence
 
Classification of data
Classification of dataClassification of data
Classification of data
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
What Is Data Science? Data Science Course - Data Science Tutorial For Beginne...
 
Big Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and IssuesBig Data Mining - Classification, Techniques and Issues
Big Data Mining - Classification, Techniques and Issues
 
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
Make Accumulated Data in Companies Eloquent by SQL Statement Constructors (PDF)
 
Data science.chapter-1,2,3
Data science.chapter-1,2,3Data science.chapter-1,2,3
Data science.chapter-1,2,3
 
Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATAA REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
A REVIEW ON CLASSIFICATION OF DATA IMBALANCE USING BIGDATA
 
Issues, challenges, and solutions
Issues, challenges, and solutionsIssues, challenges, and solutions
Issues, challenges, and solutions
 
Data analytics
Data analyticsData analytics
Data analytics
 
Text, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New WorldText, Content, and Social Analytics: BI for the New World
Text, Content, and Social Analytics: BI for the New World
 
Data Lakes versus Data Warehouses
Data Lakes versus Data WarehousesData Lakes versus Data Warehouses
Data Lakes versus Data Warehouses
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
 
Data mining
Data miningData mining
Data mining
 

Similar to BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean?

From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringRy Walker
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...Editor IJCATR
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Mr.Sameer Kumar Das
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET Journal
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksDr. Richard Otieno
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG DataPrasant Misra
 
Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaBabasab Patil
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxmadlynplamondon
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningEditor IJCATR
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )Kimberly Brooks
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its ApplicationsTracy Hill
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.ijceronline
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 
research publish journal
research publish journalresearch publish journal
research publish journalrikaseorika
 

Similar to BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean? (20)

Database Essay
Database EssayDatabase Essay
Database Essay
 
From Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data EngineeringFrom Volume to Value - A Guide to Data Engineering
From Volume to Value - A Guide to Data Engineering
 
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
A Comparative Study of Various Data Mining Techniques: Statistics, Decision T...
 
Database Essay
Database EssayDatabase Essay
Database Essay
 
Essay Database
Essay DatabaseEssay Database
Essay Database
 
Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53Sameer Kumar Das International Conference Paper 53
Sameer Kumar Das International Conference Paper 53
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
IRJET- Deduplication Detection for Similarity in Document Analysis Via Vector...
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
Data modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networksData modeling techniques used for big data in enterprise networks
Data modeling techniques used for big data in enterprise networks
 
Challenges in Analytics for BIG Data
Challenges in Analytics for BIG DataChallenges in Analytics for BIG Data
Challenges in Analytics for BIG Data
 
Futuristic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mbaFuturistic knowledge management ppt bec bagalkot mba
Futuristic knowledge management ppt bec bagalkot mba
 
Discussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docxDiscussion post· The proper implementation of a database is es.docx
Discussion post· The proper implementation of a database is es.docx
 
Semantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data MiningSemantically Enriched Knowledge Extraction With Data Mining
Semantically Enriched Knowledge Extraction With Data Mining
 
Database Management System ( Dbms )
Database Management System ( Dbms )Database Management System ( Dbms )
Database Management System ( Dbms )
 
BrightTALK - Semantic AI
BrightTALK - Semantic AI BrightTALK - Semantic AI
BrightTALK - Semantic AI
 
A Deep Dissertion Of Data Science Related Issues And Its Applications
A Deep Dissertion Of Data Science  Related Issues And Its ApplicationsA Deep Dissertion Of Data Science  Related Issues And Its Applications
A Deep Dissertion Of Data Science Related Issues And Its Applications
 
An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.An Comprehensive Study of Big Data Environment and its Challenges.
An Comprehensive Study of Big Data Environment and its Challenges.
 
research publish journal
research publish journalresearch publish journal
research publish journal
 
research publish journal
research publish journalresearch publish journal
research publish journal
 

Recently uploaded

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

BIAM 410 Final Paper - Beyond the Buzzwords: Big Data, Machine Learning, What Does It All Mean?

  • 1. BEYOND THE BUZZWORDS BIG DATA, MACHINE LEARNING – WHAT DOES IT ALL MEAN? Trones14@gmail.com Database Concepts In Business Intelligence August 4, 2016
  • 2. Table of Contents Introduction.....................................................................................................................................................................1 Categorizing Data : How to Think About Your Organizations Data .............................................................2 Defining Big Data ..........................................................................................................................................................3 Primer on Machine Learning .....................................................................................................................................5 NoSQL Database Types ...............................................................................................................................................7 Scaling...............................................................................................................................................................................7 Conclusion.......................................................................................................................................................................9 References.......................................................................................................................................................................10
  • 3. PAGE 1 Introduction “Data scientists spend anywhere from 50-80% of their time cleaning up data sets in order to find usable insights (Lohr, 2014, NYT)”. I can personally attest to the accuracy of this statement. I recently built a “Tableau Jobs” visualization. The visualization took about 30 minutes, but writing the script to use bulk API calls and correctly saving to csv took hours. There are a number of developments that needed to occur for machine learning to become a reality. Machine learning (ML) only has commercial applications due to two reasons: 1. Plenty of publicly usable data (stored in less than structured formats) to feed into the ML models 2. Computing advances that allow these models to be trained in a relatively short period of time (days or weeks) This paper: 1. Categorizes data based on some characteristics: o Degree of structure o Source of data (internal or external) 2. Explains what is really meant by “big data” in the common vernacular – The link between big data and machine learning – NoSQL storage 3. Introduces Machine learning 4. Explains why NoSQL is the choice for developers
  • 4. PAGE 2 CategorizingData : How to Think About Your OrganizationsData Data can be categorized into 3 groups: structured, semi-structured, and unstructured. There is one more key distinction: internal or external. Internal data should be structured. If a company designs the data collection system, then this data can have structure at the time of data generation. Structure at the time of generation is the best case scenario because “Data scientists spend anywhere from 50-80% of their time cleaning up data sets (creating structure) in order to find usable insights (Lohr, 2014)” Internal External Labor hours to Insight Structured Semi Structured Unstructured Process: Visualize! Process: Transform/Clean Load Visualize! Process: Find sources Write scripts to extract Transform/Clean Load Visualize!
  • 5. PAGE 3 Defining Big Data According to Johnathan Ward & Adam Barker of the University of St. Andrews, “all definitions (of big data) make at least one of the following assertions:  Size: the volume of the datasets is a critical factor.  Complexity: the structure, behavior and permutations of the datasets is a critical factor.  Technologies: the tools and techniques which are used to process a sizable or complex dataset is a critical factor.” (Becker, Ward, 2013, Undefined by Data) Barker and Ward then propose a definition, “Big data is a term describing the storage and analysis of large and or complex data sets using a series of techniques including, but not limited to: NoSQL, MapReduce and machine learning. ” (Becker, Ward, 2013) Let’s dive into some reasons why size, complexity, and technologies are all defining features of big data: 1. Size: The choice of storage, cleaning/transformation, and analysis tools depend on size: a. Small data is not a concern. i. Storage & Analysis: The computational power required to handle such small datasets is easily achieved with personal computers, there is no need to scale your job across many computing clusters if it doesn’t save time. Easy analysis means that storage choice is not a concern.
  • 6. PAGE 4 ii. Cleaning Example: It may be quicker to hand clean data in excel using simple find and replace statements rather than writing a script. 2. Complexity: Big structured data is not the issue. With a relational structure we can use SQL to easily find what we want. This data is formatted according to the specifications of the database and needs few modifications before it is ready to be analyzed. This data is usually not as big as semi-structured or unstructured data because it is normalized, there are no redundancies. This means that it is usually computationally easy to analyze this data without having to scale horizontally (adding compute clusters). 3. Technologies: This is how we store big data (NoSQL) and how we analyze it (Machine Learning). Our diagram has been narrowed down. When people speak of big data, they are usually talking about external semi- structured or unstructured data. It is this type of data that can be used by anyone for machine learning models.
  • 7. PAGE 5 Primer on Machine Learning Machine learning is perhaps the most misleading buzz word ever created. What’s the difference between machine learning and data science or statistics? Why are machine learning and Big Data gaining popularity at the same time? What is the relationship between the two? One common way to categorize machine learning (ML) is into supervised ML and unsupervised ML. When I first began diving into the tools and algorithms of machine learning, they seemed quite similar to predictive and descriptive statistics. 1. Supervised ML breaks the data into two sets: train and test. The model is built/trained on the train set, and then accuracy of the model is tested on the test set. We are interested in how well the model predicts the actual values found in the test set. 2. Unsupervised ML deals with finding hidden structure in data without giving the model any output goal. So what’s the difference between this and descriptive statistics?
  • 8. PAGE 6 Aatash Shah of Edvancer.in gives us some insight: “Robert Tibshirani, a statistician and machine learning expert at Stanford, calls machine learning ‘glorified statistics’… …Both machine learning and statistics share the same goal: Learning from data. Both these methods focus on drawing knowledge or insights from the data… … Cheap computing power and availability of large amounts of data allowed data scientists to train computers to learn by analyzing data. But, statistical modeling existed long before computers were invented.” (Shah, Aatash, 2016, Edvancer.in) Going back to our original question, What is the relationship between machine learning and big data? There are a number of developments that needed to occur for machine learning to become a reality. Without the data explosion caused by the internet, the development of NoSQL databases, and the computing advances achieved through Moore’s law, GPGPUs, and horizontal scaling of compute clusters – machine learning would be restricted to the Academic realm; impractical for the majority of commercial purposes. This brings us to the linchpin of the entire discussion. The external data out there on the internet is stored in a format that is best for the application developer --- NoSQL.
  • 9. PAGE 7 NoSQL Database Types (Habib, 2015, Appdynamics.com) Scaling “Achieving scalability and elasticity is a huge challenge for relational databases. Relational databases were designed in a period when data could be kept small, neat, and orderly.” (Allen, 2015, Marklogic.com) Relational databases are designed with the data in mind. This is done to avoid duplication, to normalize the data through the relational structure. Imposing a relational structure at development time severely limits the software developers’ flexibility for future versions of their application. The popularity of iterative
  • 10. PAGE 8 agile like software development life cycles (SDLC) only exacerbates the disadvantages of RDBMS’s. Relational NoSQL (Document) (Allen, 2014, Marklogic.com) Pay particular attention to the Data Model. Remember my story about pulling JSON and transforming it? This data was likely pulled from a document database. If the site owner decided to make a major change to the data that was included, this would be a simple change in their document database. If they were using a relational structure, they might have to go in and totally redesign the entire structure. As a job search board Indeed.com will have to scale their storage & compute power up and down based on the web traffic and amount of job postings. Scaling back down is virtually impossible with a relational structure.
  • 11. PAGE 9 Conclusion  There are 3 categories of data: structured, semi-structured, & unstructured.  There are two sources of data: internal & external.  The challenges associated with deriving insights from data apply to external data that is semi or unstructured.  The term “Big Data” refers to volume, but also encompasses the storage technologies (NoSQL) and analysis tools (Machine Learning) because they are an integral to the big data ecosystem.  Big Data is stored in less-structured NoSQL DBs for web developer agility. Final Statement: Machine learning is becoming democratized due to the availability of large amounts of less than structured data and cheap compute power. Although it would be easier for data scientists to work with structured data, this will never happen because developers need to use NoSQL databases for business requirements such as agility and scalability.
  • 12. PAGE 10 References Barker, A., & Ward, J. S. (2013, September 20). Undefined By Data: A Survey of Big Data Definitions [Scholarly project]. Retrieved from http://arxiv.org/abs/1309.5821 Habib, O. (2015, September 21) A Newbie Guide to Databases. Retrieved from https://blog.appdynamics.com/database/a-newbie-guide-to-databases/ Lohr, S. (2014). For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights. Retrieved July 20, 2016, from http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle- to-insights-is-janitor-work.html?_r=1 Lopez, K., & D'Antoni, J. (2014). The Modern Data Warehouse--How Big Data Impacts Analytics Architecture. Business Intelligence Journal, 19(3), 8-15. Machine Learning Algorithms Image. Retrieved from, https://s3.amazonaws.com/MLMastery/MachineLearningAlgorithms.png?__s=c9sqnpazsd 7pmpusegzy Machine learning frees up data scientists' time, simplifies smart applications - TechRepublic. (2015, December 14). Retrieved July 20, 2016, from http://www.techrepublic.com/article/machine- learning-frees-up-data-scientists-time-and-simplifies-smart-applications/ Making Sense of NoSQL. (n.d.). Retrieved July 21, 2016, from http://macc.foxia.com/files/macc/files/macc_mccreary.pdf Relational Databases Are Not Designed For Scale | MarkLogic. (2015, November 09). Retrieved July 23, 2016, from http://www.marklogic.com/blog/relational-databases-scale/ Shah, A. (2016, August 1) Machine Learning vs. Statistics Retrieved from, http://www.edvancer.in/machine-learning-vs-statistics/