Big Data Analytics: Concepts, Technologies, and Operations

Sameer Verma, Ph.D.
Big Data Analytics
Concepts, technologies, and operations
Sameer Verma, Ph.D.
Professor and Chair, Information Systems
Lam Family College of Business
San Francisco State University
San Francisco, CA 94132 USA
https://faculty.sfsu.edu/~sverma
sverma@sfsu.edu

Sameer Verma, Ph.D.
University of the West Indies
Institutional Academic Partner
Centre of Excellence
Mona School of Business & Mgmt
University of the West Indies
Jamaica

Sameer Verma, Ph.D.
Big Data Analytics
➔
Big
➔
Data
➔
Analytics

Sameer Verma, Ph.D.
Big
●
Volume
– Size of dataset
●
Petabytes (1015), Exabytes (1018), Zettabytes (1021).
●
Variety
– Complex
●
Structured and unstructured text, audio, video etc.
●
Velocity
– Near-real time input, processing and output.
●
Veracity
– Questionable quality of input, false discovery rates...

Sameer Verma, Ph.D.
Sample v Population
●
Sampling leads to inferences.
●
We sample randomly, or in stratified modes, to
gain a lower scale.
●
Extrapolate results to population.
– p-value is of utmost importance!
●
What if we could crunch the entire population?
– No need to sample?

Sameer Verma, Ph.D.
Data
Nature and Structure of Data

Sameer Verma, Ph.D.
Normalization
●
A process of restructuring a relational
database
●
A series of “normal forms” in order to reduce
data redundancy and improve data integrity
●
It was first proposed by Edgar F. Codd as
an integral part of his relational model.

Sameer Verma, Ph.D.
A Bookstore Example
●
Suggested fields for the bookstore:
– Title
– Author
– Author Biography
– ISBN
– Price
– Subject
– Number of Pages
– Publisher
– Publisher Address
– Description
– Review
– Reviewer Name

Sameer Verma, Ph.D.
Single Table
Multiple items

Sameer Verma, Ph.D.
Normalizing once: 1NF
Reduce redundancy across columns. Make values in each column of a
table atomic, i.e. no longer divisible
•Author
•Bio
•Subject

Sameer Verma, Ph.D.
Component tables
Author
Subject
Publisher
Book
Note: Bio can be a part of the Author table

Sameer Verma, Ph.D.
Author
Subject
Publisher
Book
Relationships

Sameer Verma, Ph.D.
NoSQL
●
Databases that require one table
●
No SQL-like relationships
●
Clickstream data
– Twitter, Facebook, etc.
●
Serialization: Reverse of Normalization

Sameer Verma, Ph.D.
JavaScript Object Notation
●
JSON or JavaScript Object Notation
{
"Table1": [
{
"id": 0,
"title": "Beginning MySQL Database Design and
Optimization"
},
{
"id": 1,
"firstname": "Jon"
},
{
"id": 2,
"lastname": "Stephens"
}
]
}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimizatio
n
Jon Stephens

Sameer Verma, Ph.D.
JSON and JBSON
●
JSON is for text-like data
●
JBSON is Binary JSON
– Serialize anything as binary!
– Store music or video as BSON.
●
More detail:
https://en.wikipedia.org/wiki/NoSQL

Sameer Verma, Ph.D.
Analytics
●
Descriptive statistics
– Frequency count, mean, variance, etc.
●
Not inferring from sample stats.
●
Usually applied to population.
●
Four stages:
– Measure, Collect, Analyze, Report.

Sameer Verma, Ph.D.
Descriptive vs Inferential
●
Inferential: As sampled and extrapolated. See Cook
& Campbell (1979)
– Statistical validity: Validity of correlation.
– Internal validity: Correlation reflects a causal relationship
– Construct validity: Higher order constructs (independent,
dependent variables)
– External validity: Generalization across variations.
●
Descriptive: Applies to the entire population, as
measured.

Sameer Verma, Ph.D.
Near-real time
●
Input is usually near-real time.
– Automated processes.
– System and user logs.
●
Processing has to be near-real time.
– Mapped and distributed.
●
Output is expected to be near-real time.
– Trends, associations.

Sameer Verma, Ph.D.
SQL vs NoSQL
●
SQL
– Large structured data broken into smaller atomic ones,
connected by relationships.
– Relationships are integral to the DBMS.
– Multiple tables and keys (primary, foreign).
●
NoSQL
– Semi-structured and unstructured data, collapsed into strings.
– Relationships have to be handled outside the DBMS.
– Single table, columnar. Usually indexed.

Sameer Verma, Ph.D.
Column DB
●
A columnar database is a table with one
column (and one more for indexing).
●
Collapse (serialize) multiple “fields” into one
string.
{"Table1": [{"id": 0,"title": "Beginning MySQL Database Design and Optimization"},{"id": 1,"firstname": "Jon"},{"id": 2,"lastname": "Stephens"}]}
Title First Name Last Name
Beginning
MySQL
Database
Design and
Optimization
Jon Stephens becomes

Sameer Verma, Ph.D.
Hbase
●
Apache Hbase.
– https://en.wikipedia.org/wiki/Apache_HBase
– Column-oriented
– Key-value datastore
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463

Sameer Verma, Ph.D.
Distributed crunching
●
Distributed
– Hadoop Distributed File System (HDFS)
●
ZooKeeper
Source: https://www.slideshare.net/hortonworks/integration-of-hive-and-hbase-12805463

Sameer Verma, Ph.D.
MapReduce
●
MapReduce
– Maps data into smaller components
– Reduces or distills the output from each
computational node.
●
Runs in unison and continuously.
●
Distributes the load across multiple cloud
machines.

Sameer Verma, Ph.D.
Cloud Computing

Sameer Verma, Ph.D.
Cloud Computing
●
Moore’s law
– Cost and size being constant, computing crunch
doubles every 18 to 24 months.
●
Metcalfe’s law
– utility of a network is proportional to the square of
the number of connected computers.
●
Both observations are exponential in nature.
●
Cloud computing is the confluence of both.

Sameer Verma, Ph.D.
Cloud Computing
●
Infrastructure as a Service (IaaS).
– Amazon, Azure, Google, Openstack...
●
Utility-oriented.
●
Pay-as-you-go.
●
Challenges: provisioning and scaling of a
given architecture.

Sameer Verma, Ph.D.
Orchestration
●
Orchestration
– Streamlined provisioning and scaling
– Distilled ops
– Abstracted away from cloud vendors
●
API
●
Provision on any cloud platform.
●
AWS, Azure, Google, Openstack...

Sameer Verma, Ph.D.
Ubuntu Juju
●
Juju
– Canonical: Makers of Ubuntu.
– Open Source
– Application and Service modeling tool
– Deploy, Manage and Scale on any cloud
– Charms - https://charmhub.io

Sameer Verma, Ph.D.
Juju + charms
Juju
http://Charmhub.io
API
LXC

Sameer Verma, Ph.D.
Hadoop Hbase via Juju
●
Hadoop Hbase “charm”
– Fourteen unit big data cluster
– A distributed big data store with MapReduce
– Run on 8 machines in your cloud.

Sameer Verma, Ph.D.
Hadoop Hbase Architecture

Sameer Verma, Ph.D.
Provisioning on any cloud
juju deploy hbase
https://charmhub.io/hbase

Sameer Verma, Ph.D.
Containers vs VM
●
Virtual Machine includes a kernel
●
Containers logically replicate all that is the
same across installs.
– Share kernel
– Account, resource and file system isolation
●
BSD jails, chroot, Docker, LXC.

Sameer Verma, Ph.D.
LXC as local cloud
●
LXC can run on a laptop
●
LXD to manage LXC containers
– https://charmhub.io/lxd
– juju deploy lxd

Sameer Verma, Ph.D.
Kubernetes
●
Container orchestration
system (via Google)
●
Containers can be a
mix&match of VMs,
Docker, etc.
●
https://en.wikipedia.org/
wiki/Kubernetes

Sameer Verma, Ph.D.
Micro Kubernetes
●
A micro installation of Kubernetes
– Microk8s (aka microkates)
– https://microk8s.io/
●
Run on your dev machine
– snap install microk8s
●
Run on Raspberry Pi
●
“Edge” device

Sameer Verma, Ph.D.
Conclusion
●
Population data (Volume)
●
Unstructured data (Variety)
●
Near-real time (Velocity)
●
Descriptive stats (Veracity)
●
Cloud Computing = crunch + network

Big Data Analytics: Concepts, Technologies, and Operations

Recommended

Recommended

More Related Content

Similar to Big Data Analytics: Concepts, Technologies, and Operations

Similar to Big Data Analytics: Concepts, Technologies, and Operations (20)

More from Sameer Verma

More from Sameer Verma (20)

Recently uploaded

Recently uploaded (20)

Big Data Analytics: Concepts, Technologies, and Operations