SlideShare a Scribd company logo
1 of 28
Data Management for Analytics
Kaushik Dutta
Information Systems and Decision Science
Muma College of Business
University of South Florida
Big Data
http://hortonworks.com/wp-content/uploads/2012/05/bigdata_diagram.png
Data Driven Applications/Analaytics
• Excel
• Database
Database
Applications /
Analytics
Analytics
Traditional Database
• Relational Database – MySQL, Oracle.
• Issues with Relational database
– Weak clustering technology
– Does not scale horizontally
• Adding 1 more node to a single instance MySQL database server doesn’t
make the performance two times.
– Strict data format – suitable for structured data only
• Why?
– Strict ACID rules in Relational Database
• Atomicity, Concurrency, Isolation and Durability
• Due to ACID rules every data needs to be synchronized across all clusters
before a transaction is completed
– Adds overhead to the database system making linear scaling impossible to achieve
Data Types
• Structured
• Semi-structured
• Unstructured -> Structured
Big Data Storage
• No-SQL Database
• Distributed File Systems
No-SQL Database
• Relaxed ACID property
• Distributed across multiple nodes
• Scaling is more important than perfect synchronization
• Semi-strict data format – suitable for unstructured data
• ACID vs. BASE
– Atomicity, Concurrency, Isolation and Durability
– Basically available, soft-state, eventually consistent
CAP Theorem
No-SQL Database
• Key Value stores
• Document Databases
• Wide-Column (or column family) stores
• Columnar Database
Key-Value Stores
• Distributed hash-table
– Key – search based on key, alpha-numeric
– Value – text, lists, set or complex objects
– Example
• Redis (http://redis.io/)
• Voldemort (LinkedIn)
• Berkeley DB
• Riak
• DynamoDB from Amazon
– Usage
• User profiles
• Session data
• Product information
Document Database
• Both key and Values are searchable
• Value – semi-structured data – (name, value) pair
• Value column may vary from row to row
– Different row may number and type of attributes
• Typical value – JSON, XML, BSON (Binary JSON)
• Example
– CouchDB (JSON)
• http://couchdb.apache.org/
– MongoDB (BSON)
• https://www.mongodb.org/
• Storing and managing text documents,
email messages, XML documents
Column-Family Stores
• Key-Value pair
– Value – wide column
• Multiple column and value pair
• Super column – collection of a set of column
• Schema-less nature so that each of their "row"s
can contain a different number of columns
• Column Family - Table
• Super Column Family / Super Column – Column
Family within a column family
• Example –
– Google BigTable
• https://cloud.google.com/bigtable/docs/
– Cassandra
• http://www.datastax.com/
• http://cassandra.apache.org/
– Dynamo DB (Amazon)
• http://aws.amazon.com/dynamodb/getting-
started/
– Hbase
• http://hbase.apache.org/
Columnar Database
• Partitioned based on columns
• Example – Kudu
No-SQL Database – ACID vs. BASE
Column-
Oriented
No-SQL
Database
Relational
Database
Structured Un-StructuredSemi-Structured
Key-Value
No-SQL
Database
Document
No-SQL
Database
HDFS – HADOOP DISTRIBUTED FILE
SYSTEM
Node
Node
Single node computing with
Single large disk Single node computing with
multiple disks in RAID
Node
Node
Node
Node
Node
Multiple node computing with
multiple disks in distributed file system
Distributed file system
HDFS
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Linux (OS)
Node
Map Reduce
Map-Reduce Workflow
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Map
Reduce
HDFS
Spark
Spark Spark Spark Spark
HDFSHDFS
RDD Memory RDD Memory RDD Memory
RDD Variables in Spark
Node
Memory
Node
Memory
Node
Memory
Node
Memory
Node
Memory
Machine Learning on Big Data
• SparkML
• Mahout
• H20
• SparkFlows
• TensorFlow
Search System
• Lucene => Solr => ElasticSearch
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
Big Data Systems – in a nutshell
• Storage
– Database – NoSQL databases
• Hbase, Cassandra, MongoDb, Kudu
– File system – Distributed file system
• HDFS, S3, GFS
• Query
– Hive - offline
– Impala - online
• Computation
– Map-Reduce
• Hadoop Map-Reduce, MongoDB Map-Reduce
– Spark
• PySpark on Jupyter
• Machine learning
– PySpark
– H2O
– TensorFlow
THANK YOU

More Related Content

What's hot

Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introductionJakub Stransky
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in HyderabadRajitha D
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...HBaseCon
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copyMohammad_Tariq
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Yahoo Developer Network
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBigDataCloud
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoopbigdatasyd
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
 
H-Base in Data Base Mangement System
H-Base in Data Base Mangement SystemH-Base in Data Base Mangement System
H-Base in Data Base Mangement SystemPreetham Devisetty
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)John Dougherty
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceJoydeep Sen Sarma
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive TutorialSandeep Patil
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionAndrew Brust
 

What's hot (20)

Hadoop hbase introduction
Hadoop hbase introductionHadoop hbase introduction
Hadoop hbase introduction
 
Hadoop Training in Hyderabad
Hadoop Training in HyderabadHadoop Training in Hyderabad
Hadoop Training in Hyderabad
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
 
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012HBase Advanced Schema Design - Berlin Buzzwords - June 2012
HBase Advanced Schema Design - Berlin Buzzwords - June 2012
 
Introduction to apache hadoop copy
Introduction to apache hadoop   copyIntroduction to apache hadoop   copy
Introduction to apache hadoop copy
 
Big Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in ChandigarhBig Data and Hadoop Training in Chandigarh
Big Data and Hadoop Training in Chandigarh
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis Sep 2012 HUG: Apache Drill for Interactive Analysis
Sep 2012 HUG: Apache Drill for Interactive Analysis
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
 
Hive: Data Warehousing for Hadoop
Hive: Data Warehousing for HadoopHive: Data Warehousing for Hadoop
Hive: Data Warehousing for Hadoop
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Rdbms
RdbmsRdbms
Rdbms
 
H-Base in Data Base Mangement System
H-Base in Data Base Mangement SystemH-Base in Data Base Mangement System
H-Base in Data Base Mangement System
 
Future of HCatalog
Future of HCatalogFuture of HCatalog
Future of HCatalog
 
Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)Hadoop Infrastructure (Oct. 3rd, 2012)
Hadoop Infrastructure (Oct. 3rd, 2012)
 
Qubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant ConferenceQubole Overview at the Fifth Elephant Conference
Qubole Overview at the Fifth Elephant Conference
 
Apache Hive Tutorial
Apache Hive TutorialApache Hive Tutorial
Apache Hive Tutorial
 
Hadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in ActionHadoop and its Ecosystem Components in Action
Hadoop and its Ecosystem Components in Action
 
מיכאל
מיכאלמיכאל
מיכאל
 

Similar to Big data Intro by Kaushik Dutta

SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullyMd Kamaruzzaman
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL David Smelker
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computingSachin Gowda
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfSumanthReddy540432
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developerJesus Rodriguez
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemSerendio Inc.
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social WebBogdan Gaza
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoopyaevents
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemDataWorks Summit
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLRichard Schneeman
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoopGeoff Hendrey
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLAbhijit Sharma
 

Similar to Big data Intro by Kaushik Dutta (20)

No sql databases
No sql databasesNo sql databases
No sql databases
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefullySQL, NoSQL, Distributed SQL: Choose your DataStore carefully
SQL, NoSQL, Distributed SQL: Choose your DataStore carefully
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Revision
RevisionRevision
Revision
 
Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL Colorado Springs Open Source Hadoop/MySQL
Colorado Springs Open Source Hadoop/MySQL
 
VTU 6th Sem Elective CSE - Module 4 cloud computing
VTU 6th Sem Elective CSE - Module 4  cloud computingVTU 6th Sem Elective CSE - Module 4  cloud computing
VTU 6th Sem Elective CSE - Module 4 cloud computing
 
module4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdfmodule4-cloudcomputing-180131071200.pdf
module4-cloudcomputing-180131071200.pdf
 
Nosql databases for the .net developer
Nosql databases for the .net developerNosql databases for the .net developer
Nosql databases for the .net developer
 
A Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop EcosystemA Scalable Data Transformation Framework using the Hadoop Ecosystem
A Scalable Data Transformation Framework using the Hadoop Ecosystem
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
NoSQL in the context of Social Web
NoSQL in the context of Social WebNoSQL in the context of Social Web
NoSQL in the context of Social Web
 
Scaling Storage and Computation with Hadoop
Scaling Storage and Computation with HadoopScaling Storage and Computation with Hadoop
Scaling Storage and Computation with Hadoop
 
A Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop EcosystemA Scalable Data Transformation Framework using Hadoop Ecosystem
A Scalable Data Transformation Framework using Hadoop Ecosystem
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Scaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQLScaling the Web: Databases & NoSQL
Scaling the Web: Databases & NoSQL
 
Apache drill
Apache drillApache drill
Apache drill
 
2013 year of real-time hadoop
2013 year of real-time hadoop2013 year of real-time hadoop
2013 year of real-time hadoop
 
Big Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQLBig Data and the growing relevance of NoSQL
Big Data and the growing relevance of NoSQL
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 

Recently uploaded

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxFurkanTasci3
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...Pooja Nehwal
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 

Recently uploaded (20)

Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Data Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptxData Science Jobs and Salaries Analysis.pptx
Data Science Jobs and Salaries Analysis.pptx
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...{Pooja:  9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
{Pooja: 9892124323 } Call Girl in Mumbai | Jas Kaur Rate 4500 Free Hotel Del...
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 

Big data Intro by Kaushik Dutta

  • 1. Data Management for Analytics Kaushik Dutta Information Systems and Decision Science Muma College of Business University of South Florida
  • 3. Data Driven Applications/Analaytics • Excel • Database Database Applications / Analytics Analytics
  • 4. Traditional Database • Relational Database – MySQL, Oracle. • Issues with Relational database – Weak clustering technology – Does not scale horizontally • Adding 1 more node to a single instance MySQL database server doesn’t make the performance two times. – Strict data format – suitable for structured data only • Why? – Strict ACID rules in Relational Database • Atomicity, Concurrency, Isolation and Durability • Due to ACID rules every data needs to be synchronized across all clusters before a transaction is completed – Adds overhead to the database system making linear scaling impossible to achieve
  • 5. Data Types • Structured • Semi-structured • Unstructured -> Structured
  • 6. Big Data Storage • No-SQL Database • Distributed File Systems
  • 7. No-SQL Database • Relaxed ACID property • Distributed across multiple nodes • Scaling is more important than perfect synchronization • Semi-strict data format – suitable for unstructured data • ACID vs. BASE – Atomicity, Concurrency, Isolation and Durability – Basically available, soft-state, eventually consistent
  • 9. No-SQL Database • Key Value stores • Document Databases • Wide-Column (or column family) stores • Columnar Database
  • 10. Key-Value Stores • Distributed hash-table – Key – search based on key, alpha-numeric – Value – text, lists, set or complex objects – Example • Redis (http://redis.io/) • Voldemort (LinkedIn) • Berkeley DB • Riak • DynamoDB from Amazon – Usage • User profiles • Session data • Product information
  • 11. Document Database • Both key and Values are searchable • Value – semi-structured data – (name, value) pair • Value column may vary from row to row – Different row may number and type of attributes • Typical value – JSON, XML, BSON (Binary JSON) • Example – CouchDB (JSON) • http://couchdb.apache.org/ – MongoDB (BSON) • https://www.mongodb.org/ • Storing and managing text documents, email messages, XML documents
  • 12. Column-Family Stores • Key-Value pair – Value – wide column • Multiple column and value pair • Super column – collection of a set of column • Schema-less nature so that each of their "row"s can contain a different number of columns • Column Family - Table • Super Column Family / Super Column – Column Family within a column family • Example – – Google BigTable • https://cloud.google.com/bigtable/docs/ – Cassandra • http://www.datastax.com/ • http://cassandra.apache.org/ – Dynamo DB (Amazon) • http://aws.amazon.com/dynamodb/getting- started/ – Hbase • http://hbase.apache.org/
  • 13. Columnar Database • Partitioned based on columns • Example – Kudu
  • 14. No-SQL Database – ACID vs. BASE Column- Oriented No-SQL Database Relational Database Structured Un-StructuredSemi-Structured Key-Value No-SQL Database Document No-SQL Database
  • 15. HDFS – HADOOP DISTRIBUTED FILE SYSTEM
  • 16. Node Node Single node computing with Single large disk Single node computing with multiple disks in RAID Node Node Node Node Node Multiple node computing with multiple disks in distributed file system Distributed file system
  • 17. HDFS Linux (OS) Node Linux (OS) Node Linux (OS) Node Linux (OS) Node Linux (OS) Node
  • 20. Spark Spark Spark Spark Spark HDFSHDFS RDD Memory RDD Memory RDD Memory RDD Variables in Spark Node Memory Node Memory Node Memory Node Memory Node Memory
  • 21. Machine Learning on Big Data • SparkML • Mahout • H20 • SparkFlows • TensorFlow
  • 22. Search System • Lucene => Solr => ElasticSearch
  • 23. Big Data Systems – in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 24. Big Data Systems – in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 25. Big Data Systems – in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 26. Big Data Systems – in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow
  • 27. Big Data Systems – in a nutshell • Storage – Database – NoSQL databases • Hbase, Cassandra, MongoDb, Kudu – File system – Distributed file system • HDFS, S3, GFS • Query – Hive - offline – Impala - online • Computation – Map-Reduce • Hadoop Map-Reduce, MongoDB Map-Reduce – Spark • PySpark on Jupyter • Machine learning – PySpark – H2O – TensorFlow