Sameer Verma, Ph.D.
Big Data Analytics
Concepts, technologies, and operations
Sameer Verma, Ph.D.
Professor and Chair, Information Systems
Lam Family College of Business
San Francisco State University
San Francisco, CA 94132 USA
https://faculty.sfsu.edu/~sverma
sverma@sfsu.edu
Sameer Verma, Ph.D.
me
Sameer Verma, Ph.D.
University of the West Indies
Institutional Academic Partner
Centre of Excellence
Mona School of Business & Mgmt
University of the West Indies
Jamaica
Sameer Verma, Ph.D.
Big Data Analytics
➔
Big
➔
Data
➔
Analytics
Sameer Verma, Ph.D.
Big
●
Volume
– Size of dataset
●
Petabytes (1015), Exabytes (1018), Zettabytes (1021).
●
Variety
– Complex
●
Structured and unstructured text, audio, video etc.
●
Velocity
– Near-real time input, processing and output.
●
Veracity
– Questionable quality of input, false discovery rates...
Sameer Verma, Ph.D.
Sample v Population
●
Sampling leads to inferences.
●
We sample randomly, or in stratified modes, to
gain a lower scale.
●
Extrapolate results to population.
– p-value is of utmost importance!
●
What if we could crunch the entire population?
– No need to sample?
Sameer Verma, Ph.D.
Data
Nature and Structure of Data
Sameer Verma, Ph.D.
Normalization
●
A process of restructuring a relational
database
●
A series of “normal forms” in order to reduce
data redundancy and improve data integrity
●
It was first proposed by Edgar F. Codd as
an integral part of his relational model.
Sameer Verma, Ph.D.
A Bookstore Example
●
Suggested fields for the bookstore:
– Title
– Author
– Author Biography
– ISBN
– Price
– Subject
– Number of Pages
– Publisher
– Publisher Address
– Description
– Review
– Reviewer Name
Sameer Verma, Ph.D.
Single Table
Multiple items
Sameer Verma, Ph.D.
Normalizing once: 1NF
Reduce redundancy across columns. Make values in each column of a
table atomic, i.e. no longer divisible
•Author
•Bio
•Subject
Sameer Verma, Ph.D.
Component tables
Author
Subject
Publisher
Book
Note: Bio can be a part of the Author table
Sameer Verma, Ph.D.
Author
Subject
Publisher
Book
Relationships
Sameer Verma, Ph.D.
NoSQL
●
Databases that require one table
●
No SQL-like relationships
●
Clickstream data
– Twitter, Facebook, etc.
●
Serialization: Reverse of Normalization
Sameer Verma, Ph.D.
JavaScript Object Notation
●
JSON or JavaScript Object Notation
{
"Table1": [
{
"id": 0,
"title": "Beginning MySQL Database Design and
Optimization"
},
{
"id": 1,
"firstname": "Jon"
},
{
"id": 2,
"lastname": "Stephens"
}
]
}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimizatio
n
Jon Stephens
Sameer Verma, Ph.D.
JSON and JBSON
●
JSON is for text-like data
●
JBSON is Binary JSON
– Serialize anything as binary!
– Store music or video as BSON.
●
More detail:
https://en.wikipedia.org/wiki/NoSQL
Sameer Verma, Ph.D.
Analytics
●
Descriptive statistics
– Frequency count, mean, variance, etc.
●
Not inferring from sample stats.
●
Usually applied to population.
●
Four stages:
– Measure, Collect, Analyze, Report.
Sameer Verma, Ph.D.
Descriptive vs Inferential
●
Inferential: As sampled and extrapolated. See Cook
& Campbell (1979)
– Statistical validity: Validity of correlation.
– Internal validity: Correlation reflects a causal relationship
– Construct validity: Higher order constructs (independent,
dependent variables)
– External validity: Generalization across variations.
●
Descriptive: Applies to the entire population, as
measured.
Sameer Verma, Ph.D.
Near-real time
●
Input is usually near-real time.
– Automated processes.
– System and user logs.
●
Processing has to be near-real time.
– Mapped and distributed.
●
Output is expected to be near-real time.
– Trends, associations.
Sameer Verma, Ph.D.
SQL vs NoSQL
●
SQL
– Large structured data broken into smaller atomic ones,
connected by relationships.
– Relationships are integral to the DBMS.
– Multiple tables and keys (primary, foreign).
●
NoSQL
– Semi-structured and unstructured data, collapsed into strings.
– Relationships have to be handled outside the DBMS.
– Single table, columnar. Usually indexed.
Sameer Verma, Ph.D.
Column DB
●
A columnar database is a table with one
column (and one more for indexing).
●
Collapse (serialize) multiple “fields” into one
string.
{"Table1": [{"id": 0,"title": "Beginning MySQL Database Design and Optimization"},{"id": 1,"firstname": "Jon"},{"id": 2,"lastname": "Stephens"}]}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimization
Jon Stephens becomes
Sameer Verma, Ph.D.
Hbase
●
Apache Hbase.
– https://en.wikipedia.org/wiki/Apache_HBase
– Column-oriented
– Key-value datastore
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
Sameer Verma, Ph.D.
Distributed crunching
●
Distributed
– Hadoop Distributed File System (HDFS)
●
ZooKeeper
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
Sameer Verma, Ph.D.
MapReduce
●
MapReduce
– Maps data into smaller components
– Reduces or distills the output from each
computational node.
●
Runs in unison and continuously.
●
Distributes the load across multiple cloud
machines.
Sameer Verma, Ph.D.
Cloud Computing
Sameer Verma, Ph.D.
Cloud Computing
●
Moore’s law
– Cost and size being constant, computing crunch
doubles every 18 to 24 months.
●
Metcalfe’s law
– utility of a network is proportional to the square of
the number of connected computers.
●
Both observations are exponential in nature.
●
Cloud computing is the confluence of both.
Sameer Verma, Ph.D.
Cloud Computing
●
Infrastructure as a Service (IaaS).
– Amazon, Azure, Google, Openstack...
●
Utility-oriented.
●
Pay-as-you-go.
●
Challenges: provisioning and scaling of a
given architecture.
Sameer Verma, Ph.D.
Orchestration
●
Orchestration
– Streamlined provisioning and scaling
– Distilled ops
– Abstracted away from cloud vendors
●
API
●
Provision on any cloud platform.
●
AWS, Azure, Google, Openstack...
Sameer Verma, Ph.D.
Ubuntu Juju
●
Juju
– Canonical: Makers of Ubuntu.
– Open Source
– Application and Service modeling tool
– Deploy, Manage and Scale on any cloud
– Charms - https://charmhub.io
Sameer Verma, Ph.D.
Juju + charms
Juju
http://Charmhub.io
API
LXC
Sameer Verma, Ph.D.
Hadoop Hbase via Juju
●
Hadoop Hbase “charm”
– Fourteen unit big data cluster
– A distributed big data store with MapReduce
– Run on 8 machines in your cloud.
Sameer Verma, Ph.D.
Hadoop Hbase Architecture
Sameer Verma, Ph.D.
Provisioning on any cloud
juju deploy hbase
https://charmhub.io/hbase
Sameer Verma, Ph.D.
Containers vs VM
●
Virtual Machine includes a kernel
●
Containers logically replicate all that is the
same across installs.
– Share kernel
– Account, resource and file system isolation
●
BSD jails, chroot, Docker, LXC.
Sameer Verma, Ph.D.
LXC as local cloud
●
LXC can run on a laptop
●
LXD to manage LXC containers
– https://charmhub.io/lxd
– juju deploy lxd
Sameer Verma, Ph.D.
Kubernetes
●
Container orchestration
system (via Google)
●
Containers can be a
mix&match of VMs,
Docker, etc.
●
https://en.wikipedia.org/
wiki/Kubernetes
Sameer Verma, Ph.D.
Micro Kubernetes
●
A micro installation of Kubernetes
– Microk8s (aka microkates)
– https://microk8s.io/
●
Run on your dev machine
– snap install microk8s
●
Run on Raspberry Pi
●
“Edge” device
Sameer Verma, Ph.D.
Conclusion
●
Population data (Volume)
●
Unstructured data (Variety)
●
Near-real time (Velocity)
●
Descriptive stats (Veracity)
●
Cloud Computing = crunch + network

Big Data Analytics: Concepts, Technologies, and Operations

  • 1.
    Sameer Verma, Ph.D. BigData Analytics Concepts, technologies, and operations Sameer Verma, Ph.D. Professor and Chair, Information Systems Lam Family College of Business San Francisco State University San Francisco, CA 94132 USA https://faculty.sfsu.edu/~sverma sverma@sfsu.edu
  • 2.
  • 3.
    Sameer Verma, Ph.D. Universityof the West Indies Institutional Academic Partner Centre of Excellence Mona School of Business & Mgmt University of the West Indies Jamaica
  • 4.
    Sameer Verma, Ph.D. BigData Analytics ➔ Big ➔ Data ➔ Analytics
  • 5.
    Sameer Verma, Ph.D. Big ● Volume –Size of dataset ● Petabytes (1015), Exabytes (1018), Zettabytes (1021). ● Variety – Complex ● Structured and unstructured text, audio, video etc. ● Velocity – Near-real time input, processing and output. ● Veracity – Questionable quality of input, false discovery rates...
  • 6.
    Sameer Verma, Ph.D. Samplev Population ● Sampling leads to inferences. ● We sample randomly, or in stratified modes, to gain a lower scale. ● Extrapolate results to population. – p-value is of utmost importance! ● What if we could crunch the entire population? – No need to sample?
  • 7.
    Sameer Verma, Ph.D. Data Natureand Structure of Data
  • 8.
    Sameer Verma, Ph.D. Normalization ● Aprocess of restructuring a relational database ● A series of “normal forms” in order to reduce data redundancy and improve data integrity ● It was first proposed by Edgar F. Codd as an integral part of his relational model.
  • 9.
    Sameer Verma, Ph.D. ABookstore Example ● Suggested fields for the bookstore: – Title – Author – Author Biography – ISBN – Price – Subject – Number of Pages – Publisher – Publisher Address – Description – Review – Reviewer Name
  • 10.
    Sameer Verma, Ph.D. SingleTable Multiple items
  • 11.
    Sameer Verma, Ph.D. Normalizingonce: 1NF Reduce redundancy across columns. Make values in each column of a table atomic, i.e. no longer divisible •Author •Bio •Subject
  • 12.
    Sameer Verma, Ph.D. Componenttables Author Subject Publisher Book Note: Bio can be a part of the Author table
  • 13.
  • 14.
    Sameer Verma, Ph.D. NoSQL ● Databasesthat require one table ● No SQL-like relationships ● Clickstream data – Twitter, Facebook, etc. ● Serialization: Reverse of Normalization
  • 15.
    Sameer Verma, Ph.D. JavaScriptObject Notation ● JSON or JavaScript Object Notation { "Table1": [ { "id": 0, "title": "Beginning MySQL Database Design and Optimization" }, { "id": 1, "firstname": "Jon" }, { "id": 2, "lastname": "Stephens" } ] } Title First Name Last Name Beginning MySQL Database Design and Optimizatio n Jon Stephens
  • 16.
    Sameer Verma, Ph.D. JSONand JBSON ● JSON is for text-like data ● JBSON is Binary JSON – Serialize anything as binary! – Store music or video as BSON. ● More detail: https://en.wikipedia.org/wiki/NoSQL
  • 17.
    Sameer Verma, Ph.D. Analytics ● Descriptivestatistics – Frequency count, mean, variance, etc. ● Not inferring from sample stats. ● Usually applied to population. ● Four stages: – Measure, Collect, Analyze, Report.
  • 18.
    Sameer Verma, Ph.D. Descriptivevs Inferential ● Inferential: As sampled and extrapolated. See Cook & Campbell (1979) – Statistical validity: Validity of correlation. – Internal validity: Correlation reflects a causal relationship – Construct validity: Higher order constructs (independent, dependent variables) – External validity: Generalization across variations. ● Descriptive: Applies to the entire population, as measured.
  • 19.
    Sameer Verma, Ph.D. Near-realtime ● Input is usually near-real time. – Automated processes. – System and user logs. ● Processing has to be near-real time. – Mapped and distributed. ● Output is expected to be near-real time. – Trends, associations.
  • 20.
    Sameer Verma, Ph.D. SQLvs NoSQL ● SQL – Large structured data broken into smaller atomic ones, connected by relationships. – Relationships are integral to the DBMS. – Multiple tables and keys (primary, foreign). ● NoSQL – Semi-structured and unstructured data, collapsed into strings. – Relationships have to be handled outside the DBMS. – Single table, columnar. Usually indexed.
  • 21.
    Sameer Verma, Ph.D. ColumnDB ● A columnar database is a table with one column (and one more for indexing). ● Collapse (serialize) multiple “fields” into one string. {"Table1": [{"id": 0,"title": "Beginning MySQL Database Design and Optimization"},{"id": 1,"firstname": "Jon"},{"id": 2,"lastname": "Stephens"}]} Title First Name Last Name Beginning MySQL Database Design and Optimization Jon Stephens becomes
  • 22.
    Sameer Verma, Ph.D. Hbase ● ApacheHbase. – https://en.wikipedia.org/wiki/Apache_HBase – Column-oriented – Key-value datastore Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
  • 23.
    Sameer Verma, Ph.D. Distributedcrunching ● Distributed – Hadoop Distributed File System (HDFS) ● ZooKeeper Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463
  • 24.
    Sameer Verma, Ph.D. MapReduce ● MapReduce –Maps data into smaller components – Reduces or distills the output from each computational node. ● Runs in unison and continuously. ● Distributes the load across multiple cloud machines.
  • 25.
  • 26.
    Sameer Verma, Ph.D. CloudComputing ● Moore’s law – Cost and size being constant, computing crunch doubles every 18 to 24 months. ● Metcalfe’s law – utility of a network is proportional to the square of the number of connected computers. ● Both observations are exponential in nature. ● Cloud computing is the confluence of both.
  • 27.
    Sameer Verma, Ph.D. CloudComputing ● Infrastructure as a Service (IaaS). – Amazon, Azure, Google, Openstack... ● Utility-oriented. ● Pay-as-you-go. ● Challenges: provisioning and scaling of a given architecture.
  • 28.
    Sameer Verma, Ph.D. Orchestration ● Orchestration –Streamlined provisioning and scaling – Distilled ops – Abstracted away from cloud vendors ● API ● Provision on any cloud platform. ● AWS, Azure, Google, Openstack...
  • 29.
    Sameer Verma, Ph.D. UbuntuJuju ● Juju – Canonical: Makers of Ubuntu. – Open Source – Application and Service modeling tool – Deploy, Manage and Scale on any cloud – Charms - https://charmhub.io
  • 30.
    Sameer Verma, Ph.D. Juju+ charms Juju http://Charmhub.io API LXC
  • 31.
    Sameer Verma, Ph.D. HadoopHbase via Juju ● Hadoop Hbase “charm” – Fourteen unit big data cluster – A distributed big data store with MapReduce – Run on 8 machines in your cloud.
  • 32.
    Sameer Verma, Ph.D. HadoopHbase Architecture
  • 33.
    Sameer Verma, Ph.D. Provisioningon any cloud juju deploy hbase https://charmhub.io/hbase
  • 34.
    Sameer Verma, Ph.D. Containersvs VM ● Virtual Machine includes a kernel ● Containers logically replicate all that is the same across installs. – Share kernel – Account, resource and file system isolation ● BSD jails, chroot, Docker, LXC.
  • 35.
    Sameer Verma, Ph.D. LXCas local cloud ● LXC can run on a laptop ● LXD to manage LXC containers – https://charmhub.io/lxd – juju deploy lxd
  • 36.
    Sameer Verma, Ph.D. Kubernetes ● Containerorchestration system (via Google) ● Containers can be a mix&match of VMs, Docker, etc. ● https://en.wikipedia.org/ wiki/Kubernetes
  • 37.
    Sameer Verma, Ph.D. MicroKubernetes ● A micro installation of Kubernetes – Microk8s (aka microkates) – https://microk8s.io/ ● Run on your dev machine – snap install microk8s ● Run on Raspberry Pi ● “Edge” device
  • 38.
    Sameer Verma, Ph.D. Conclusion ● Populationdata (Volume) ● Unstructured data (Variety) ● Near-real time (Velocity) ● Descriptive stats (Veracity) ● Cloud Computing = crunch + network