Data Management for Analytics
Kaushik Dutta
Information Systems and Decision Science
Muma College of Business
University of South Florida
Big Data
http://hortonworks.com/wp-content/uploads/2012/05/bigdata_diagram.png
Data Driven Applications/Analaytics
• Excel
• Database
Database
Applications /
Analytics
Analytics
Traditional Database
• Relational Database – MySQL, Oracle.
• Issues with Relational database
– Weak clustering technology
– Does not scale horizontally
• Adding 1 more node to a single instance MySQL database server doesn’t
make the performance two times.
– Strict data format – suitable for structured data only
• Why?
– Strict ACID rules in Relational Database
• Atomicity, Concurrency, Isolation and Durability
• Due to ACID rules every data needs to be synchronized across all clusters
before a transaction is completed
– Adds overhead to the database system making linear scaling impossible to achieve
Data Types
• Structured
• Semi-structured
• Unstructured -> Structured
Big Data Storage
• No-SQL Database
• Distributed File Systems
No-SQL Database
• Relaxed ACID property
• Distributed across multiple nodes
• Scaling is more important than perfect synchronization
• Semi-strict data format – suitable for unstructured data
• ACID vs. BASE
– Atomicity, Concurrency, Isolation and Durability
– Basically available, soft-state, eventually consistent
CAP Theorem
No-SQL Database
• Key Value stores
• Document Databases
• Wide-Column (or column family) stores
• Columnar Database
Key-Value Stores
• Distributed hash-table
– Key – search based on key, alpha-numeric
– Value – text, lists, set or complex objects
– Example
• Redis (http://redis.io/)
• Voldemort (LinkedIn)
• Berkeley DB
• Riak
• DynamoDB from Amazon
– Usage
• User profiles
• Session data
• Product information
Document Database
• Both key and Values are searchable
• Value – semi-structured data – (name, value) pair
• Value column may vary from row to row
– Different row may number and type of attributes
• Typical value – JSON, XML, BSON (Binary JSON)
• Example
– CouchDB (JSON)
• http://couchdb.apache.org/
– MongoDB (BSON)
• https://www.mongodb.org/
• Storing and managing text documents,
email messages, XML documents
Column-Family Stores
• Key-Value pair
– Value – wide column
• Multiple column and value pair
• Super column – collection of a set of column
• Schema-less nature so that each of their "row"s
can contain a different number of columns
• Column Family - Table
• Super Column Family / Super Column – Column
Family within a column family
• Example –
– Google BigTable
• https://cloud.google.com/bigtable/docs/
– Cassandra
• http://www.datastax.com/
• http://cassandra.apache.org/
– Dynamo DB (Amazon)
• http://aws.amazon.com/dynamodb/getting-
started/
– Hbase
• http://hbase.apache.org/
Columnar Database
• Partitioned based on columns
• Example – Kudu
No-SQL Database – ACID vs. BASE
Column-
Oriented
No-SQL
Database
Relational
Database
Structured Un-StructuredSemi-Structured
Key-Value
No-SQL
Database
Document
No-SQL
Database
HDFS – HADOOP DISTRIBUTED FILE
SYSTEM
Node
Node
Single node computing with
Single large disk Single node computing with
multiple disks in RAID
Node
Node
Node
Node
Node
Multiple node computing with
multiple disks in distributed file system
Distributed file system
HDFS
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Map Reduce
Map-Reduce Workflow
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Spark
Spark Spark Spark Spark
HDFSHDFS
RDD Memory RDD Memory RDD Memory
RDD Variables in Spark
Node
Memory
Node
Memory
Node
Memory
Node
Memory
Node
Memory
Machine Learning on Big Data
• SparkML
• Mahout
• H20
• SparkFlows
• TensorFlow
Search System
• Lucene => Solr => ElasticSearch
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
THANK YOU

Big data Intro by Kaushik Dutta

  • 1.
    Data Management forAnalytics Kaushik Dutta Information Systems and Decision Science Muma College of Business University of South Florida
  • 2.
  • 3.
    Data Driven Applications/Analaytics •Excel • Database Database Applications / Analytics Analytics
  • 4.
    Traditional Database • RelationalDatabase – MySQL, Oracle. • Issues with Relational database – Weak clustering technology – Does not scale horizontally • Adding 1 more node to a single instance MySQL database server doesn’t make the performance two times. – Strict data format – suitable for structured data only • Why? – Strict ACID rules in Relational Database • Atomicity, Concurrency, Isolation and Durability • Due to ACID rules every data needs to be synchronized across all clusters before a transaction is completed – Adds overhead to the database system making linear scaling impossible to achieve
  • 5.
    Data Types • Structured •Semi-structured • Unstructured -> Structured
  • 6.
    Big Data Storage •No-SQL Database • Distributed File Systems
  • 7.
    No-SQL Database • RelaxedACID property • Distributed across multiple nodes • Scaling is more important than perfect synchronization • Semi-strict data format – suitable for unstructured data • ACID vs. BASE – Atomicity, Concurrency, Isolation and Durability – Basically available, soft-state, eventually consistent
  • 8.
  • 9.
    No-SQL Database • KeyValue stores • Document Databases • Wide-Column (or column family) stores • Columnar Database
  • 10.
    Key-Value Stores • Distributedhash-table – Key – search based on key, alpha-numeric – Value – text, lists, set or complex objects – Example • Redis (http://redis.io/) • Voldemort (LinkedIn) • Berkeley DB • Riak • DynamoDB from Amazon – Usage • User profiles • Session data • Product information
  • 11.
    Document Database • Bothkey and Values are searchable • Value – semi-structured data – (name, value) pair • Value column may vary from row to row – Different row may number and type of attributes • Typical value – JSON, XML, BSON (Binary JSON) • Example – CouchDB (JSON) • http://couchdb.apache.org/ – MongoDB (BSON) • https://www.mongodb.org/ • Storing and managing text documents, email messages, XML documents
  • 12.
    Column-Family Stores • Key-Valuepair – Value – wide column • Multiple column and value pair • Super column – collection of a set of column • Schema-less nature so that each of their "row"s can contain a different number of columns • Column Family - Table • Super Column Family / Super Column – Column Family within a column family • Example – – Google BigTable • https://cloud.google.com/bigtable/docs/ – Cassandra • http://www.datastax.com/ • http://cassandra.apache.org/ – Dynamo DB (Amazon) • http://aws.amazon.com/dynamodb/getting- started/ – Hbase • http://hbase.apache.org/
  • 13.
    Columnar Database • Partitionedbased on columns • Example – Kudu
  • 14.
    No-SQL Database –ACID vs. BASE Column- Oriented No-SQL Database Relational Database Structured Un-StructuredSemi-Structured Key-Value No-SQL Database Document No-SQL Database
  • 15.
    HDFS – HADOOPDISTRIBUTED FILE SYSTEM
  • 16.
    Node Node Single node computingwith Single large disk Single node computing with multiple disks in RAID Node Node Node Node Node Multiple node computing with multiple disks in distributed file system Distributed file system
  • 17.
    HDFS Linux (OS) Node Linux (OS) Node Linux(OS) Node Linux (OS) Node Linux (OS) Node
  • 18.
  • 19.
  • 20.
    Spark Spark Spark SparkSpark HDFSHDFS RDD Memory RDD Memory RDD Memory RDD Variables in Spark Node Memory Node Memory Node Memory Node Memory Node Memory
  • 21.
    Machine Learning onBig Data • SparkML • Mahout • H20 • SparkFlows • TensorFlow
  • 22.
    Search System • Lucene=> Solr => ElasticSearch
  • 23.
    Big Data Systems– in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 24.
    Big Data Systems– in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 25.
    Big Data Systems– in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 26.
    Big Data Systems– in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 27.
    Big Data Systems– in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 28.