SlideShare a Scribd company logo
Sameer Verma, Ph.D.
Big Data Analytics
Concepts, technologies, and operations
Sameer Verma, Ph.D.
Professor and Chair, Information Systems
Lam Family College of Business
San Francisco State University
San Francisco, CA 94132 USA
https://faculty.sfsu.edu/~sverma
sverma@sfsu.edu
Sameer Verma, Ph.D.
me
Sameer Verma, Ph.D.
University of the West Indies
Institutional Academic Partner
Centre of Excellence
Mona School of Business & Mgmt
University of the West Indies
Jamaica
Sameer Verma, Ph.D.
Big Data Analytics
➔
Big
➔
Data
➔
Analytics
Sameer Verma, Ph.D.
Big
●
Volume
– Size of dataset
●
Petabytes (1015), Exabytes (1018), Zettabytes (1021).
●
Variety
– Complex
●
Structured and unstructured text, audio, video etc.
●
Velocity
– Near-real time input, processing and output.
●
Veracity
– Questionable quality of input, false discovery rates...
Sameer Verma, Ph.D.
Sample v Population
●
Sampling leads to inferences.
●
We sample randomly, or in stratified modes, to
gain a lower scale.
●
Extrapolate results to population.
– p-value is of utmost importance!
●
What if we could crunch the entire population?
– No need to sample?
Sameer Verma, Ph.D.
Data
Nature and Structure of Data
Sameer Verma, Ph.D.
Normalization
●
A process of restructuring a relational
database
●
A series of “normal forms” in order to reduce
data redundancy and improve data integrity
●
It was first proposed by Edgar F. Codd as
an integral part of his relational model.
Sameer Verma, Ph.D.
A Bookstore Example
●
Suggested fields for the bookstore:
– Title
– Author
– Author Biography
– ISBN
– Price
– Subject
– Number of Pages
– Publisher
– Publisher Address
– Description
– Review
– Reviewer Name
Sameer Verma, Ph.D.
Single Table
Multiple items
Sameer Verma, Ph.D.
Normalizing once: 1NF
Reduce redundancy across columns. Make values in each column of a
table atomic, i.e. no longer divisible
•Author
•Bio
•Subject
Sameer Verma, Ph.D.
Component tables
Author
Subject
Publisher
Book
Note: Bio can be a part of the Author table
Sameer Verma, Ph.D.
Author
Subject
Publisher
Book
Relationships
Sameer Verma, Ph.D.
NoSQL
●
Databases that require one table
●
No SQL-like relationships
●
Clickstream data
– Twitter, Facebook, etc.
●
Serialization: Reverse of Normalization
Sameer Verma, Ph.D.
JavaScript Object Notation
●
JSON or JavaScript Object Notation
{
"Table1": [
{
"id": 0,
"title": "Beginning MySQL Database Design and
Optimization"
},
{
"id": 1,
"firstname": "Jon"
},
{
"id": 2,
"lastname": "Stephens"
}
]
}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimizatio
n
Jon Stephens
Sameer Verma, Ph.D.
JSON and JBSON
●
JSON is for text-like data
●
JBSON is Binary JSON
– Serialize anything as binary!
– Store music or video as BSON.
●
More detail:
https://en.wikipedia.org/wiki/NoSQL
Sameer Verma, Ph.D.
Analytics
●
Descriptive statistics
– Frequency count, mean, variance, etc.
●
Not inferring from sample stats.
●
Usually applied to population.
●
Four stages:
– Measure, Collect, Analyze, Report.
Sameer Verma, Ph.D.
Descriptive vs Inferential
●
Inferential: As sampled and extrapolated. See Cook
& Campbell (1979)
– Statistical validity: Validity of correlation.
– Internal validity: Correlation reflects a causal relationship
– Construct validity: Higher order constructs (independent,
dependent variables)
– External validity: Generalization across variations.
●
Descriptive: Applies to the entire population, as
measured.
Sameer Verma, Ph.D.
Near-real time
●
Input is usually near-real time.
– Automated processes.
– System and user logs.
●
Processing has to be near-real time.
– Mapped and distributed.
●
Output is expected to be near-real time.
– Trends, associations.
Sameer Verma, Ph.D.
SQL vs NoSQL
●
SQL
– Large structured data broken into smaller atomic ones,
connected by relationships.
– Relationships are integral to the DBMS.
– Multiple tables and keys (primary, foreign).
●
NoSQL
– Semi-structured and unstructured data, collapsed into strings.
– Relationships have to be handled outside the DBMS.
– Single table, columnar. Usually indexed.
Sameer Verma, Ph.D.
Column DB
●
A columnar database is a table with one
column (and one more for indexing).
●
Collapse (serialize) multiple “fields” into one
string.
{"Table1": [{"id": 0,"title": "Beginning MySQL Database Design and Optimization"},{"id": 1,"firstname": "Jon"},{"id": 2,"lastname": "Stephens"}]}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimization
Jon Stephens becomes
Sameer Verma, Ph.D.
Hbase
●
Apache Hbase.
– https://en.wikipedia.org/wiki/Apache_HBase
– Column-oriented
– Key-value datastore
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
Sameer Verma, Ph.D.
Distributed crunching
●
Distributed
– Hadoop Distributed File System (HDFS)
●
ZooKeeper
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
Sameer Verma, Ph.D.
MapReduce
●
MapReduce
– Maps data into smaller components
– Reduces or distills the output from each
computational node.
●
Runs in unison and continuously.
●
Distributes the load across multiple cloud
machines.
Sameer Verma, Ph.D.
Cloud Computing
Sameer Verma, Ph.D.
Cloud Computing
●
Moore’s law
– Cost and size being constant, computing crunch
doubles every 18 to 24 months.
●
Metcalfe’s law
– utility of a network is proportional to the square of
the number of connected computers.
●
Both observations are exponential in nature.
●
Cloud computing is the confluence of both.
Sameer Verma, Ph.D.
Cloud Computing
●
Infrastructure as a Service (IaaS).
– Amazon, Azure, Google, Openstack...
●
Utility-oriented.
●
Pay-as-you-go.
●
Challenges: provisioning and scaling of a
given architecture.
Sameer Verma, Ph.D.
Orchestration
●
Orchestration
– Streamlined provisioning and scaling
– Distilled ops
– Abstracted away from cloud vendors
●
API
●
Provision on any cloud platform.
●
AWS, Azure, Google, Openstack...
Sameer Verma, Ph.D.
Ubuntu Juju
●
Juju
– Canonical: Makers of Ubuntu.
– Open Source
– Application and Service modeling tool
– Deploy, Manage and Scale on any cloud
– Charms - https://charmhub.io
Sameer Verma, Ph.D.
Juju + charms
Juju
http://Charmhub.io
API
LXC
Sameer Verma, Ph.D.
Hadoop Hbase via Juju
●
Hadoop Hbase “charm”
– Fourteen unit big data cluster
– A distributed big data store with MapReduce
– Run on 8 machines in your cloud.
Sameer Verma, Ph.D.
Hadoop Hbase Architecture
Sameer Verma, Ph.D.
Provisioning on any cloud
juju deploy hbase
https://charmhub.io/hbase
Sameer Verma, Ph.D.
Containers vs VM
●
Virtual Machine includes a kernel
●
Containers logically replicate all that is the
same across installs.
– Share kernel
– Account, resource and file system isolation
●
BSD jails, chroot, Docker, LXC.
Sameer Verma, Ph.D.
LXC as local cloud
●
LXC can run on a laptop
●
LXD to manage LXC containers
– https://charmhub.io/lxd
– juju deploy lxd
Sameer Verma, Ph.D.
Kubernetes
●
Container orchestration
system (via Google)
●
Containers can be a
mix&match of VMs,
Docker, etc.
●
https://en.wikipedia.org/
wiki/Kubernetes
Sameer Verma, Ph.D.
Micro Kubernetes
●
A micro installation of Kubernetes
– Microk8s (aka microkates)
– https://microk8s.io/
●
Run on your dev machine
– snap install microk8s
●
Run on Raspberry Pi
●
“Edge” device
Sameer Verma, Ph.D.
Conclusion
●
Population data (Volume)
●
Unstructured data (Variety)
●
Near-real time (Velocity)
●
Descriptive stats (Veracity)
●
Cloud Computing = crunch + network

More Related Content

Similar to Big Data Analytics: Concepts, Technologies, and Operations

Quinns Resume
Quinns ResumeQuinns Resume
Quinns Resume
Quinn Owens
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
Si Krishan
 
Shubhangi nov20
Shubhangi nov20Shubhangi nov20
Shubhangi nov20
Shubhangi Tandon
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network Security
Deris Stiawan
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Amazon Web Services
 
Lecture09
Lecture09Lecture09
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
MLReview
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
Oscar Corcho
 
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmeaIntroduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Sandesh Rao
 
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEAIntroduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Sandesh Rao
 
Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016
Mark Yashar
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017
Mark Yashar
 
ATI Courses Professional Development Short Course Applied Measurement Engin...
ATI Courses Professional Development Short Course Applied  Measurement  Engin...ATI Courses Professional Development Short Course Applied  Measurement  Engin...
ATI Courses Professional Development Short Course Applied Measurement Engin...
Jim Jenkins
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
Héloïse Nonne
 
Full_resume_Dr_Russell_John_Childs
Full_resume_Dr_Russell_John_ChildsFull_resume_Dr_Russell_John_Childs
Full_resume_Dr_Russell_John_Childs
Russell Childs
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
ASHOK KUMAR
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
Vijay Karan
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
Vijay Karan
 
Detection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning AlgorithmDetection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning Algorithm
IRJET Journal
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
stelligence
 

Similar to Big Data Analytics: Concepts, Technologies, and Operations (20)

Quinns Resume
Quinns ResumeQuinns Resume
Quinns Resume
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
 
Shubhangi nov20
Shubhangi nov20Shubhangi nov20
Shubhangi nov20
 
The Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network SecurityThe Challenges, Gaps and Future Trends: Network Security
The Challenges, Gaps and Future Trends: Network Security
 
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
Big Data in the Cloud: How the RISElab Enables Computers to Make Intelligent ...
 
Lecture09
Lecture09Lecture09
Lecture09
 
Tutorial on Deep Generative Models
 Tutorial on Deep Generative Models Tutorial on Deep Generative Models
Tutorial on Deep Generative Models
 
Semantics in Sensor Networks
Semantics in Sensor NetworksSemantics in Sensor Networks
Semantics in Sensor Networks
 
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmeaIntroduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
Introduction to Machine learning - DBA's to data scientists - Oct 2020 - OGBEmea
 
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEAIntroduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
Introduction to Machine Learning - From DBA's to Data Scientists - OGBEMEA
 
Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016Mark_Yashar_Resume_Fall_2016
Mark_Yashar_Resume_Fall_2016
 
Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017Mark_Yashar_Resume_2017
Mark_Yashar_Resume_2017
 
ATI Courses Professional Development Short Course Applied Measurement Engin...
ATI Courses Professional Development Short Course Applied  Measurement  Engin...ATI Courses Professional Development Short Course Applied  Measurement  Engin...
ATI Courses Professional Development Short Course Applied Measurement Engin...
 
Big Data Analytics for connected home
Big Data Analytics for connected homeBig Data Analytics for connected home
Big Data Analytics for connected home
 
Full_resume_Dr_Russell_John_Childs
Full_resume_Dr_Russell_John_ChildsFull_resume_Dr_Russell_John_Childs
Full_resume_Dr_Russell_John_Childs
 
Machine learning ppt.
Machine learning ppt.Machine learning ppt.
Machine learning ppt.
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
IEEE 2014 C# Projects
IEEE 2014 C# ProjectsIEEE 2014 C# Projects
IEEE 2014 C# Projects
 
Detection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning AlgorithmDetection of Phishing Websites using machine Learning Algorithm
Detection of Phishing Websites using machine Learning Algorithm
 
Navy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurityNavy security contest-bigdataforsecurity
Navy security contest-bigdataforsecurity
 

More from Sameer Verma

From Efficiency to Innovation: Transforming Business Value through Gen AI
From Efficiency to Innovation:  Transforming Business Value through Gen AIFrom Efficiency to Innovation:  Transforming Business Value through Gen AI
From Efficiency to Innovation: Transforming Business Value through Gen AI
Sameer Verma
 
A Framework for Information Access in Rural and Remote Communities
A Framework for Information Access in Rural and Remote CommunitiesA Framework for Information Access in Rural and Remote Communities
A Framework for Information Access in Rural and Remote Communities
Sameer Verma
 
Digital Commons: A Strategic View
Digital Commons: A Strategic ViewDigital Commons: A Strategic View
Digital Commons: A Strategic View
Sameer Verma
 
The Commons Initiative at SF State
The Commons Initiative at SF StateThe Commons Initiative at SF State
The Commons Initiative at SF State
Sameer Verma
 
Civictech in Academia
Civictech in AcademiaCivictech in Academia
Civictech in Academia
Sameer Verma
 
Tci sfsu-uo h-2015
Tci sfsu-uo h-2015Tci sfsu-uo h-2015
Tci sfsu-uo h-2015
Sameer Verma
 
Juju, LXC, OpenStack: Fun with Private Clouds
Juju, LXC, OpenStack: Fun with Private CloudsJuju, LXC, OpenStack: Fun with Private Clouds
Juju, LXC, OpenStack: Fun with Private Clouds
Sameer Verma
 
XOVis - Analytics and Visualization for Sugar and OLPC
XOVis - Analytics and Visualization for Sugar and OLPCXOVis - Analytics and Visualization for Sugar and OLPC
XOVis - Analytics and Visualization for Sugar and OLPC
Sameer Verma
 
"Computer, end program": Virtualization and the Cloud
"Computer, end program": Virtualization and the Cloud"Computer, end program": Virtualization and the Cloud
"Computer, end program": Virtualization and the Cloud
Sameer Verma
 
Creativity and Innovation with One Laptop per Child
Creativity and Innovation with One Laptop per ChildCreativity and Innovation with One Laptop per Child
Creativity and Innovation with One Laptop per Child
Sameer Verma
 
OLPC from around the World
OLPC from around the WorldOLPC from around the World
OLPC from around the World
Sameer Verma
 
The Joy of Z Axis: Creativity and Innovation through 3D Printing
The Joy of Z Axis: Creativity and Innovation through 3D PrintingThe Joy of Z Axis: Creativity and Innovation through 3D Printing
The Joy of Z Axis: Creativity and Innovation through 3D Printing
Sameer Verma
 
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
Sameer Verma
 
Pathagar at Books in Browsers 13
Pathagar at Books in Browsers 13Pathagar at Books in Browsers 13
Pathagar at Books in Browsers 13
Sameer Verma
 
Education and Social Inclusion through Information
Education and Social Inclusion through InformationEducation and Social Inclusion through Information
Education and Social Inclusion through Information
Sameer Verma
 
Drupal and the Semantic Web
Drupal and the Semantic WebDrupal and the Semantic Web
Drupal and the Semantic Web
Sameer Verma
 
Computer, end program
Computer, end programComputer, end program
Computer, end program
Sameer Verma
 
Social Justice and Equity through Information
Social Justice and Equity through InformationSocial Justice and Equity through Information
Social Justice and Equity through Information
Sameer Verma
 
Social Justice and Equity through Information
Social Justice and Equity through InformationSocial Justice and Equity through Information
Social Justice and Equity through Information
Sameer Verma
 
Facilitating a Digital Commons for Generations to Come
Facilitating a Digital Commons for Generations to ComeFacilitating a Digital Commons for Generations to Come
Facilitating a Digital Commons for Generations to Come
Sameer Verma
 

More from Sameer Verma (20)

From Efficiency to Innovation: Transforming Business Value through Gen AI
From Efficiency to Innovation:  Transforming Business Value through Gen AIFrom Efficiency to Innovation:  Transforming Business Value through Gen AI
From Efficiency to Innovation: Transforming Business Value through Gen AI
 
A Framework for Information Access in Rural and Remote Communities
A Framework for Information Access in Rural and Remote CommunitiesA Framework for Information Access in Rural and Remote Communities
A Framework for Information Access in Rural and Remote Communities
 
Digital Commons: A Strategic View
Digital Commons: A Strategic ViewDigital Commons: A Strategic View
Digital Commons: A Strategic View
 
The Commons Initiative at SF State
The Commons Initiative at SF StateThe Commons Initiative at SF State
The Commons Initiative at SF State
 
Civictech in Academia
Civictech in AcademiaCivictech in Academia
Civictech in Academia
 
Tci sfsu-uo h-2015
Tci sfsu-uo h-2015Tci sfsu-uo h-2015
Tci sfsu-uo h-2015
 
Juju, LXC, OpenStack: Fun with Private Clouds
Juju, LXC, OpenStack: Fun with Private CloudsJuju, LXC, OpenStack: Fun with Private Clouds
Juju, LXC, OpenStack: Fun with Private Clouds
 
XOVis - Analytics and Visualization for Sugar and OLPC
XOVis - Analytics and Visualization for Sugar and OLPCXOVis - Analytics and Visualization for Sugar and OLPC
XOVis - Analytics and Visualization for Sugar and OLPC
 
"Computer, end program": Virtualization and the Cloud
"Computer, end program": Virtualization and the Cloud"Computer, end program": Virtualization and the Cloud
"Computer, end program": Virtualization and the Cloud
 
Creativity and Innovation with One Laptop per Child
Creativity and Innovation with One Laptop per ChildCreativity and Innovation with One Laptop per Child
Creativity and Innovation with One Laptop per Child
 
OLPC from around the World
OLPC from around the WorldOLPC from around the World
OLPC from around the World
 
The Joy of Z Axis: Creativity and Innovation through 3D Printing
The Joy of Z Axis: Creativity and Innovation through 3D PrintingThe Joy of Z Axis: Creativity and Innovation through 3D Printing
The Joy of Z Axis: Creativity and Innovation through 3D Printing
 
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
One Laptop per Child and Sugar: Collaborative, Joyful and Self-empowered Lear...
 
Pathagar at Books in Browsers 13
Pathagar at Books in Browsers 13Pathagar at Books in Browsers 13
Pathagar at Books in Browsers 13
 
Education and Social Inclusion through Information
Education and Social Inclusion through InformationEducation and Social Inclusion through Information
Education and Social Inclusion through Information
 
Drupal and the Semantic Web
Drupal and the Semantic WebDrupal and the Semantic Web
Drupal and the Semantic Web
 
Computer, end program
Computer, end programComputer, end program
Computer, end program
 
Social Justice and Equity through Information
Social Justice and Equity through InformationSocial Justice and Equity through Information
Social Justice and Equity through Information
 
Social Justice and Equity through Information
Social Justice and Equity through InformationSocial Justice and Equity through Information
Social Justice and Equity through Information
 
Facilitating a Digital Commons for Generations to Come
Facilitating a Digital Commons for Generations to ComeFacilitating a Digital Commons for Generations to Come
Facilitating a Digital Commons for Generations to Come
 

Recently uploaded

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
flufftailshop
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Jeffrey Haguewood
 

Recently uploaded (20)

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdfNunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
Nunit vs XUnit vs MSTest Differences Between These Unit Testing Frameworks.pdf
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
Letter and Document Automation for Bonterra Impact Management (fka Social Sol...
 

Big Data Analytics: Concepts, Technologies, and Operations

  • 1. Sameer Verma, Ph.D. Big Data Analytics Concepts, technologies, and operations Sameer Verma, Ph.D. Professor and Chair, Information Systems Lam Family College of Business San Francisco State University San Francisco, CA 94132 USA https://faculty.sfsu.edu/~sverma sverma@sfsu.edu
  • 3. Sameer Verma, Ph.D. University of the West Indies Institutional Academic Partner Centre of Excellence Mona School of Business & Mgmt University of the West Indies Jamaica
  • 4. Sameer Verma, Ph.D. Big Data Analytics ➔ Big ➔ Data ➔ Analytics
  • 5. Sameer Verma, Ph.D. Big ● Volume – Size of dataset ● Petabytes (1015), Exabytes (1018), Zettabytes (1021). ● Variety – Complex ● Structured and unstructured text, audio, video etc. ● Velocity – Near-real time input, processing and output. ● Veracity – Questionable quality of input, false discovery rates...
  • 6. Sameer Verma, Ph.D. Sample v Population ● Sampling leads to inferences. ● We sample randomly, or in stratified modes, to gain a lower scale. ● Extrapolate results to population. – p-value is of utmost importance! ● What if we could crunch the entire population? – No need to sample?
  • 7. Sameer Verma, Ph.D. Data Nature and Structure of Data
  • 8. Sameer Verma, Ph.D. Normalization ● A process of restructuring a relational database ● A series of “normal forms” in order to reduce data redundancy and improve data integrity ● It was first proposed by Edgar F. Codd as an integral part of his relational model.
  • 9. Sameer Verma, Ph.D. A Bookstore Example ● Suggested fields for the bookstore: – Title – Author – Author Biography – ISBN – Price – Subject – Number of Pages – Publisher – Publisher Address – Description – Review – Reviewer Name
  • 10. Sameer Verma, Ph.D. Single Table Multiple items
  • 11. Sameer Verma, Ph.D. Normalizing once: 1NF Reduce redundancy across columns. Make values in each column of a table atomic, i.e. no longer divisible •Author •Bio •Subject
  • 12. Sameer Verma, Ph.D. Component tables Author Subject Publisher Book Note: Bio can be a part of the Author table
  • 14. Sameer Verma, Ph.D. NoSQL ● Databases that require one table ● No SQL-like relationships ● Clickstream data – Twitter, Facebook, etc. ● Serialization: Reverse of Normalization
  • 15. Sameer Verma, Ph.D. JavaScript Object Notation ● JSON or JavaScript Object Notation { "Table1": [ { "id": 0, "title": "Beginning MySQL Database Design and Optimization" }, { "id": 1, "firstname": "Jon" }, { "id": 2, "lastname": "Stephens" } ] } Title First Name Last Name Beginning MySQL Database Design and Optimizatio n Jon Stephens
  • 16. Sameer Verma, Ph.D. JSON and JBSON ● JSON is for text-like data ● JBSON is Binary JSON – Serialize anything as binary! – Store music or video as BSON. ● More detail: https://en.wikipedia.org/wiki/NoSQL
  • 17. Sameer Verma, Ph.D. Analytics ● Descriptive statistics – Frequency count, mean, variance, etc. ● Not inferring from sample stats. ● Usually applied to population. ● Four stages: – Measure, Collect, Analyze, Report.
  • 18. Sameer Verma, Ph.D. Descriptive vs Inferential ● Inferential: As sampled and extrapolated. See Cook & Campbell (1979) – Statistical validity: Validity of correlation. – Internal validity: Correlation reflects a causal relationship – Construct validity: Higher order constructs (independent, dependent variables) – External validity: Generalization across variations. ● Descriptive: Applies to the entire population, as measured.
  • 19. Sameer Verma, Ph.D. Near-real time ● Input is usually near-real time. – Automated processes. – System and user logs. ● Processing has to be near-real time. – Mapped and distributed. ● Output is expected to be near-real time. – Trends, associations.
  • 20. Sameer Verma, Ph.D. SQL vs NoSQL ● SQL – Large structured data broken into smaller atomic ones, connected by relationships. – Relationships are integral to the DBMS. – Multiple tables and keys (primary, foreign). ● NoSQL – Semi-structured and unstructured data, collapsed into strings. – Relationships have to be handled outside the DBMS. – Single table, columnar. Usually indexed.
  • 21. Sameer Verma, Ph.D. Column DB ● A columnar database is a table with one column (and one more for indexing). ● Collapse (serialize) multiple “fields” into one string. {"Table1": [{"id": 0,"title": "Beginning MySQL Database Design and Optimization"},{"id": 1,"firstname": "Jon"},{"id": 2,"lastname": "Stephens"}]} Title First Name Last Name Beginning MySQL Database Design and Optimization Jon Stephens becomes
  • 22. Sameer Verma, Ph.D. Hbase ● Apache Hbase. – https://en.wikipedia.org/wiki/Apache_HBase – Column-oriented – Key-value datastore Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
  • 23. Sameer Verma, Ph.D. Distributed crunching ● Distributed – Hadoop Distributed File System (HDFS) ● ZooKeeper Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
  • 24. Sameer Verma, Ph.D. MapReduce ● MapReduce – Maps data into smaller components – Reduces or distills the output from each computational node. ● Runs in unison and continuously. ● Distributes the load across multiple cloud machines.
  • 26. Sameer Verma, Ph.D. Cloud Computing ● Moore’s law – Cost and size being constant, computing crunch doubles every 18 to 24 months. ● Metcalfe’s law – utility of a network is proportional to the square of the number of connected computers. ● Both observations are exponential in nature. ● Cloud computing is the confluence of both.
  • 27. Sameer Verma, Ph.D. Cloud Computing ● Infrastructure as a Service (IaaS). – Amazon, Azure, Google, Openstack... ● Utility-oriented. ● Pay-as-you-go. ● Challenges: provisioning and scaling of a given architecture.
  • 28. Sameer Verma, Ph.D. Orchestration ● Orchestration – Streamlined provisioning and scaling – Distilled ops – Abstracted away from cloud vendors ● API ● Provision on any cloud platform. ● AWS, Azure, Google, Openstack...
  • 29. Sameer Verma, Ph.D. Ubuntu Juju ● Juju – Canonical: Makers of Ubuntu. – Open Source – Application and Service modeling tool – Deploy, Manage and Scale on any cloud – Charms - https://charmhub.io
  • 30. Sameer Verma, Ph.D. Juju + charms Juju http://Charmhub.io API LXC
  • 31. Sameer Verma, Ph.D. Hadoop Hbase via Juju ● Hadoop Hbase “charm” – Fourteen unit big data cluster – A distributed big data store with MapReduce – Run on 8 machines in your cloud.
  • 32. Sameer Verma, Ph.D. Hadoop Hbase Architecture
  • 33. Sameer Verma, Ph.D. Provisioning on any cloud juju deploy hbase https://charmhub.io/hbase
  • 34. Sameer Verma, Ph.D. Containers vs VM ● Virtual Machine includes a kernel ● Containers logically replicate all that is the same across installs. – Share kernel – Account, resource and file system isolation ● BSD jails, chroot, Docker, LXC.
  • 35. Sameer Verma, Ph.D. LXC as local cloud ● LXC can run on a laptop ● LXD to manage LXC containers – https://charmhub.io/lxd – juju deploy lxd
  • 36. Sameer Verma, Ph.D. Kubernetes ● Container orchestration system (via Google) ● Containers can be a mix&match of VMs, Docker, etc. ● https://en.wikipedia.org/ wiki/Kubernetes
  • 37. Sameer Verma, Ph.D. Micro Kubernetes ● A micro installation of Kubernetes – Microk8s (aka microkates) – https://microk8s.io/ ● Run on your dev machine – snap install microk8s ● Run on Raspberry Pi ● “Edge” device
  • 38. Sameer Verma, Ph.D. Conclusion ● Population data (Volume) ● Unstructured data (Variety) ● Near-real time (Velocity) ● Descriptive stats (Veracity) ● Cloud Computing = crunch + network