Big data Intro by Kaushik Dutta

Data Management for Analytics
Kaushik Dutta
Information Systems and Decision Science
Muma College of Business
University of South Florida

Big Data
http://hortonworks.com/wp-content/uploads/2012/05/bigdata_diagram.png

Data Driven Applications/Analaytics
• Excel
• Database
Database
Applications /
Analytics
Analytics

Traditional Database
• Relational Database – MySQL, Oracle.
• Issues with Relational database
– Weak clustering technology
– Does not scale horizontally
• Adding 1 more node to a single instance MySQL database server doesn’t
make the performance two times.
– Strict data format – suitable for structured data only
• Why?
– Strict ACID rules in Relational Database
• Atomicity, Concurrency, Isolation and Durability
• Due to ACID rules every data needs to be synchronized across all clusters
before a transaction is completed
– Adds overhead to the database system making linear scaling impossible to achieve

Data Types
• Structured
• Semi-structured
• Unstructured -> Structured

Big Data Storage
• No-SQL Database
• Distributed File Systems

No-SQL Database
• Relaxed ACID property
• Distributed across multiple nodes
• Scaling is more important than perfect synchronization
• Semi-strict data format – suitable for unstructured data
• ACID vs. BASE
– Atomicity, Concurrency, Isolation and Durability
– Basically available, soft-state, eventually consistent

No-SQL Database
• Key Value stores
• Document Databases
• Wide-Column (or column family) stores
• Columnar Database

Key-Value Stores
• Distributed hash-table
– Key – search based on key, alpha-numeric
– Value – text, lists, set or complex objects
– Example
• Redis (http://redis.io/)
• Voldemort (LinkedIn)
• Berkeley DB
• Riak
• DynamoDB from Amazon
– Usage
• User profiles
• Session data
• Product information

Document Database
• Both key and Values are searchable
• Value – semi-structured data – (name, value) pair
• Value column may vary from row to row
– Different row may number and type of attributes
• Typical value – JSON, XML, BSON (Binary JSON)
• Example
– CouchDB (JSON)
• http://couchdb.apache.org/
– MongoDB (BSON)
• https://www.mongodb.org/
• Storing and managing text documents,
email messages, XML documents

Column-Family Stores
• Key-Value pair
– Value – wide column
• Multiple column and value pair
• Super column – collection of a set of column
• Schema-less nature so that each of their "row"s
can contain a different number of columns
• Column Family - Table
• Super Column Family / Super Column – Column
Family within a column family
• Example –
– Google BigTable
• https://cloud.google.com/bigtable/docs/
– Cassandra
• http://www.datastax.com/
• http://cassandra.apache.org/
– Dynamo DB (Amazon)
• http://aws.amazon.com/dynamodb/getting-
started/
– Hbase
• http://hbase.apache.org/

Columnar Database
• Partitioned based on columns
• Example – Kudu

No-SQL Database – ACID vs. BASE
Column-
Oriented
No-SQL
Database
Relational
Database
Structured Un-StructuredSemi-Structured
Key-Value
No-SQL
Database
Document
No-SQL
Database

HDFS – HADOOP DISTRIBUTED FILE
SYSTEM

Node
Node
Single node computing with
Single large disk Single node computing with
multiple disks in RAID
Node
Node
Node
Node
Node
Multiple node computing with
multiple disks in distributed file system
Distributed file system

HDFS
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node

Map-Reduce Workflow
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS

Spark
Spark Spark Spark Spark
HDFSHDFS
RDD Memory RDD Memory RDD Memory
RDD Variables in Spark
Node
Memory
Node
Memory
Node
Memory
Node
Memory
Node
Memory

Machine Learning on Big Data
• SparkML
• Mahout
• H20
• SparkFlows
• TensorFlow

Search System
• Lucene => Solr => ElasticSearch

Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow

Big data Intro by Kaushik Dutta

More Related Content

What's hot

Similar to Big data Intro by Kaushik Dutta

Recently uploaded

Big data Intro by Kaushik Dutta