SlideShare a Scribd company logo
Big Data trends and the rising importance of NOSQL Abhijit Sharma, Architect,  Innovation & Incubation Lab, BMC Software
Trends in cloud, web, and even enterprise scale apps Unprecedented growth in - Data set sizes which need to be stored, analyzed Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, Foursquare Connectedness and democratization of data social networks, feeds, blogs, wiki, tags, semantic web  Data API’s - mash up data - use  Twitter, FB, Flickr API’s Semi structured or unstructured data Performance requirements of these apps Humongous R/W Scalability  High Availability  Trading consistency for availability – ACID not mandatory
RDBMS woes Challenge - Storing and scaling humongous amounts of data and remaining highly available Vertical scaling mostly - upper limit & expensive Horizontal scaling – no automatic sharding, no rebalancing – no infrastructure Distributed transactions & joins due to normalization inhibit performance, availability Schema less data models – rigid schema – alter table, null columns  Deeply connected data – not designed for this
NOSQL is  NOT  No SQL The NOSQL Alternative
NOSQL is  simply Not only SQL The NOSQL Alternative
NOSQL – So what else is it? “One size fits all” RDBMS is not working  NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends. They can be categorized along these axes - Data Model - simple to complex Scalability – single to horizontal Persistence
NOSQL categories Graph Databases Based on Graph theory Data model – graph,  nodes, edges, properties Scalability – single node – high performance Persistence – On disk data structures Examples – Neo4J,  AllegroGraph Document Databases Based loosely on documents/Lotus Notes Data model – collections of documents Scalability – horizontal,  auto-sharding & replication Persistence – B-Tree Examples – mongoDB, CouchDB
NOSQL categories Column Stores Based on Google’s BigTable design Data model - big table, column families Scalability – horizontal, auto-sharding & replication Persistence – Memory + File (on DFS) Examples – HBase, Cassandra Key Value Stores Based on DHT,  Amazon’s Dynamo design Data model – collection of key value pairs Scalability – horizontal, auto-sharding & replication Persistence – Memory or File  Examples – Redis,  Amazon Dynamo, Voldemort
Graph Databases
Graph oriented data Graphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al. Deep trees, complex networks Graph traversal - apt for expressing graph related problems (shortest path, network size etc.)
LinkedIn Social Graph
Why not RDBMS for large scale graphs? Difficult to model and traverse graphs in RDBMS recursive approaches - slow SQL queries that span many table joins Hacks like storing paths for trees
Graph Databases Designed for efficient storage & traversal of large scale graphs Natural modeling of graph network - nodes, relationships and their properties Neo4J is a leading graph db Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs REST API and various language bindings Graph pattern matching,  Cypher Query language, Indexer – Lucene
Graph basics
All Paths & My Network size
Shortest path between …
Is connected to?
You may know…
Mining your network Centrality Algorithms  Closeness  – who has the most followers on twitter  Betweenness – who has more influential people following them Eigenvector – PageRank
Document Databases
Flexible document oriented data Document style unstructured data - schema less – e.g. JSON documents No alter table needed like in an RDBMS, de-normalized data Useful for iterative/agile development Humongous scale - billions of documents, R/W traffic – millions/sec,  horizontal scalability,  availability mongoDB is a leading document database
Document Database – Use cases Use cases : Archiving of historic data which has undergone many schema changes Flexible set of performance metrics – web site page views, unique visitors  etc.  - change over time – no need to update existing JSON documents Track near real time metrics - optimized increment of perf counters Geo Loc based mobile and gaming apps (Geospatial indices can be key here)
Craigslist Archival Database Premium service to customers allowed search over their  historical postings Archival (no purging) of 10 years of postings - billions of documents Schema changes across versions MySQL based archival database  ALTER TABLE took a month to complete
Foursquare ,[object Object]
Geo : Optimized for geo location queries – Find Starbucks near my current GPS location,[object Object]
mongoDB Features JSON documents, collection oriented storage Rich, document-based queries Indexes on document attributes Fast in-place updates Scalability features	 Horizontal scalability Configurable replication and high-availability Auto-sharding & rebalancing Language specific drivers – Java, Scala, Ruby etc.
Column Stores
Column Store Reasonably rich data model –  sparse, distributed, persistent multi-dimensional sorted map Sorted row keys, columns Use cases - Large scale data storage and analysis like -  Time series data along with associated dimension data  Row keys are timestamps and thus sorted – helps time range queries Google analytics Provides aggregate statistics, # unique visitors/day, page views/URL/day Raw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted  Data Cube - CPU OS Time DC
Column Store Performance Excellent R/W performance – large storage – PB’s High scalability - horizontal scaling,  auto-sharding High Availability - transparent replication of data HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
Column Store - HBase Table defines  Column Families  -  groups similar attributes ,  vertical partitioning  (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value  Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key) Ordered Rows by key, Ordered columns in a Column Family Rows can have different number of columns  Columns have value and versions (any number) Row range & column range and key range queries
HBase Architecture
Key Value Stores
Key Value Stores Simplest possible data model Caching a user’s personalized, rendered page – avoid DB S3 bucket storage for blob data against a unique id Range of KV stores Distributed, scaleable persistent key-value storage – Dynamo,  Voldemort Auto-Partitioned key space  Replicated KV Highly Available Largely in-memory KV stores – Redis, memcached Redis blazing fast for cache and other interesting operations
Redis In memory KV store  Blazing fast – 100 K/sec R/W Async snapshot to disk More than KV store – a data structure store –  Supports lists, queues, sets and operations on them Sorted list range operations Set operations UNION,  INTERSECTION,  DIFF
Redis – Use Cases Web session caching with EXPIRE set for session expiry Live real time bit.ly URL stats like clicks etc – fast increments of counters Auto Complete – Type first few characters – maps to a sort list and a range query is fired Publish / Subscribe – fan out a message to subscribers Set operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back
Thanks Email : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog : abhijitsharma.blogspot.com

More Related Content

What's hot

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Cloudera, Inc.
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
shnkr_rmchndrn
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
Ines Sombra
 
From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise
BSP Media Group
 
SDMA-FDMA-TDMA-fixed TDM
SDMA-FDMA-TDMA-fixed TDMSDMA-FDMA-TDMA-fixed TDM
SDMA-FDMA-TDMA-fixed TDM
SanSan149
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
Joseph Niemiec
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
nzhang
 
Semantic web
Semantic webSemantic web
Semantic web
Ronit Mathur
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Meshal Albeedhani
 
Introduction to Cassandra and datastax DSE
Introduction to Cassandra and datastax DSEIntroduction to Cassandra and datastax DSE
Introduction to Cassandra and datastax DSE
Ulises Fasoli
 
Preparing yourdataforcloud
Preparing yourdataforcloudPreparing yourdataforcloud
Preparing yourdataforcloud
Inphina Technologies
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
Geek Sync | Successfully Migrating Existing Databases to Azure SQL Database
Geek Sync | Successfully Migrating Existing Databases to Azure SQL DatabaseGeek Sync | Successfully Migrating Existing Databases to Azure SQL Database
Geek Sync | Successfully Migrating Existing Databases to Azure SQL Database
IDERA Software
 

What's hot (13)

Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - FacebookHadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
Hadoop World 2011: Apache HBase Road Map - Jonathan Gray - Facebook
 
Navigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skiesNavigating NoSQL in cloudy skies
Navigating NoSQL in cloudy skies
 
North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911North Bay Ruby Meetup 101911
North Bay Ruby Meetup 101911
 
From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise From big data to big value : Infrastructure need and Huawei best practise
From big data to big value : Infrastructure need and Huawei best practise
 
SDMA-FDMA-TDMA-fixed TDM
SDMA-FDMA-TDMA-fixed TDMSDMA-FDMA-TDMA-fixed TDM
SDMA-FDMA-TDMA-fixed TDM
 
Hbase mhug 2015
Hbase mhug 2015Hbase mhug 2015
Hbase mhug 2015
 
WaterlooHiveTalk
WaterlooHiveTalkWaterlooHiveTalk
WaterlooHiveTalk
 
Semantic web
Semantic webSemantic web
Semantic web
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Introduction to Cassandra and datastax DSE
Introduction to Cassandra and datastax DSEIntroduction to Cassandra and datastax DSE
Introduction to Cassandra and datastax DSE
 
Preparing yourdataforcloud
Preparing yourdataforcloudPreparing yourdataforcloud
Preparing yourdataforcloud
 
Techincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql databaseTechincal Talk Hbase-Ditributed,no-sql database
Techincal Talk Hbase-Ditributed,no-sql database
 
Geek Sync | Successfully Migrating Existing Databases to Azure SQL Database
Geek Sync | Successfully Migrating Existing Databases to Azure SQL DatabaseGeek Sync | Successfully Migrating Existing Databases to Azure SQL Database
Geek Sync | Successfully Migrating Existing Databases to Azure SQL Database
 

Viewers also liked

2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
Michel Dumontier
 
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
UserZoom
 
Catalogo tony tallarin
Catalogo tony tallarinCatalogo tony tallarin
Catalogo tony tallarinAndres Garces
 
Charity and Email
Charity and EmailCharity and Email
Charity and Email
raneez
 
Balaur.ro - Cristian George Strat
Balaur.ro - Cristian George StratBalaur.ro - Cristian George Strat
Balaur.ro - Cristian George Strat
GeekMeet
 
Design for Innovation (D4I) Improvement Process
Design for Innovation (D4I) Improvement ProcessDesign for Innovation (D4I) Improvement Process
Design for Innovation (D4I) Improvement Process
Iain Sanders
 
iPhone and Appstore
iPhone and AppstoreiPhone and Appstore
iPhone and AppstoreHome
 
Cda esm waste oil disposal application part 2
Cda esm waste oil disposal application part 2Cda esm waste oil disposal application part 2
Cda esm waste oil disposal application part 2
Oboni Riskope Associates Inc.
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
Ntino Krampis
 
Free Software
Free SoftwareFree Software
Free Software
Toni Pierdomé
 
Tamk Conference Finished 2008
Tamk Conference Finished 2008Tamk Conference Finished 2008
Tamk Conference Finished 2008
Peter Heath MA, PGCE, MISTC
 
"mettiamoci sempre dove si prende"
"mettiamoci sempre dove si prende""mettiamoci sempre dove si prende"
"mettiamoci sempre dove si prende"
Denis Ferraretti
 
Generell presentasjon
Generell presentasjonGenerell presentasjon
Generell presentasjon
Glenn Melby
 
UserZoom & Key Lime Interactive Healthcare Webinar
UserZoom & Key Lime Interactive Healthcare WebinarUserZoom & Key Lime Interactive Healthcare Webinar
UserZoom & Key Lime Interactive Healthcare Webinar
UserZoom
 
Hermeneus Euskaraz - Jakintza Librea
Hermeneus Euskaraz - Jakintza LibreaHermeneus Euskaraz - Jakintza Librea
Hermeneus Euskaraz - Jakintza Librea
Xabi del Rey
 
Social media analysis for toronto 2010 mayoral election
Social media analysis for toronto 2010 mayoral electionSocial media analysis for toronto 2010 mayoral election
Social media analysis for toronto 2010 mayoral election
Patrick Gladney
 
2011 CANARIE User's Forum
2011 CANARIE User's Forum2011 CANARIE User's Forum
2011 CANARIE User's Forum
Michel Dumontier
 
Part 4: New HIV Treatment Pipeline
Part 4: New HIV Treatment PipelinePart 4: New HIV Treatment Pipeline
Part 4: New HIV Treatment Pipeline
NAPWA
 

Viewers also liked (20)

2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...2010 CASCON - Towards a integrated network of data and services for the life ...
2010 CASCON - Towards a integrated network of data and services for the life ...
 
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
Combining Quantitative & Qualitative Data in a Single Large scale User Resear...
 
Catalogo tony tallarin
Catalogo tony tallarinCatalogo tony tallarin
Catalogo tony tallarin
 
Charity and Email
Charity and EmailCharity and Email
Charity and Email
 
Balaur.ro - Cristian George Strat
Balaur.ro - Cristian George StratBalaur.ro - Cristian George Strat
Balaur.ro - Cristian George Strat
 
Design for Innovation (D4I) Improvement Process
Design for Innovation (D4I) Improvement ProcessDesign for Innovation (D4I) Improvement Process
Design for Innovation (D4I) Improvement Process
 
iPhone and Appstore
iPhone and AppstoreiPhone and Appstore
iPhone and Appstore
 
Propostadedecretplurilingisme2011
Propostadedecretplurilingisme2011Propostadedecretplurilingisme2011
Propostadedecretplurilingisme2011
 
Cda esm waste oil disposal application part 2
Cda esm waste oil disposal application part 2Cda esm waste oil disposal application part 2
Cda esm waste oil disposal application part 2
 
Large scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in BioinformaticsLarge scale data-parsing with Hadoop in Bioinformatics
Large scale data-parsing with Hadoop in Bioinformatics
 
Free Software
Free SoftwareFree Software
Free Software
 
The Beatles
The BeatlesThe Beatles
The Beatles
 
Tamk Conference Finished 2008
Tamk Conference Finished 2008Tamk Conference Finished 2008
Tamk Conference Finished 2008
 
"mettiamoci sempre dove si prende"
"mettiamoci sempre dove si prende""mettiamoci sempre dove si prende"
"mettiamoci sempre dove si prende"
 
Generell presentasjon
Generell presentasjonGenerell presentasjon
Generell presentasjon
 
UserZoom & Key Lime Interactive Healthcare Webinar
UserZoom & Key Lime Interactive Healthcare WebinarUserZoom & Key Lime Interactive Healthcare Webinar
UserZoom & Key Lime Interactive Healthcare Webinar
 
Hermeneus Euskaraz - Jakintza Librea
Hermeneus Euskaraz - Jakintza LibreaHermeneus Euskaraz - Jakintza Librea
Hermeneus Euskaraz - Jakintza Librea
 
Social media analysis for toronto 2010 mayoral election
Social media analysis for toronto 2010 mayoral electionSocial media analysis for toronto 2010 mayoral election
Social media analysis for toronto 2010 mayoral election
 
2011 CANARIE User's Forum
2011 CANARIE User's Forum2011 CANARIE User's Forum
2011 CANARIE User's Forum
 
Part 4: New HIV Treatment Pipeline
Part 4: New HIV Treatment PipelinePart 4: New HIV Treatment Pipeline
Part 4: New HIV Treatment Pipeline
 

Similar to Big Data and the growing relevance of NoSQL

NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
Rakesh Jayaram
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
James Serra
 
No sql landscape_nosqltips
No sql landscape_nosqltipsNo sql landscape_nosqltips
No sql landscape_nosqltips
imarcticblue
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
Amazon Web Services
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
Martin Bém
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
Philippe Julio
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Rukmani Gopalan
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
Amazon Web Services
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
Amazon Web Services
 
Big data Intro by Kaushik Dutta
Big data Intro by Kaushik DuttaBig data Intro by Kaushik Dutta
Big data Intro by Kaushik Dutta
Kaushik Dutta
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
No sql databases
No sql databasesNo sql databases
No sql databases
Walaa Hamdy Assy
 
Nosql seminar
Nosql seminarNosql seminar
Building Data Solutions with Azure
Building Data Solutions with AzureBuilding Data Solutions with Azure
Building Data Solutions with Azure
Dinusha Kumarasiri
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
Lucian Neghina
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
Amazon Web Services
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
Jon Meredith
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
Amazon Web Services
 

Similar to Big Data and the growing relevance of NoSQL (20)

NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Modern data warehouse
Modern data warehouseModern data warehouse
Modern data warehouse
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
No sql landscape_nosqltips
No sql landscape_nosqltipsNo sql landscape_nosqltips
No sql landscape_nosqltips
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
Sql Bits 2020 - Designing Performant and Scalable Data Lakes using Azure Data...
 
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
AWS re:Invent 2016: Big Data Architectural Patterns and Best Practices on AWS...
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Big data Intro by Kaushik Dutta
Big data Intro by Kaushik DuttaBig data Intro by Kaushik Dutta
Big data Intro by Kaushik Dutta
 
2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql2018 05 08_biological_databases_no_sql
2018 05 08_biological_databases_no_sql
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Nosql seminar
Nosql seminarNosql seminar
Nosql seminar
 
Building Data Solutions with Azure
Building Data Solutions with AzureBuilding Data Solutions with Azure
Building Data Solutions with Azure
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Front Range PHP NoSQL Databases
Front Range PHP NoSQL DatabasesFront Range PHP NoSQL Databases
Front Range PHP NoSQL Databases
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 

Recently uploaded

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 

Recently uploaded (20)

20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 

Big Data and the growing relevance of NoSQL

  • 1. Big Data trends and the rising importance of NOSQL Abhijit Sharma, Architect, Innovation & Incubation Lab, BMC Software
  • 2. Trends in cloud, web, and even enterprise scale apps Unprecedented growth in - Data set sizes which need to be stored, analyzed Big Data - Cloud scale services generate TB’s > PB’s – FB, eBay, Digg, Foursquare Connectedness and democratization of data social networks, feeds, blogs, wiki, tags, semantic web Data API’s - mash up data - use Twitter, FB, Flickr API’s Semi structured or unstructured data Performance requirements of these apps Humongous R/W Scalability High Availability Trading consistency for availability – ACID not mandatory
  • 3. RDBMS woes Challenge - Storing and scaling humongous amounts of data and remaining highly available Vertical scaling mostly - upper limit & expensive Horizontal scaling – no automatic sharding, no rebalancing – no infrastructure Distributed transactions & joins due to normalization inhibit performance, availability Schema less data models – rigid schema – alter table, null columns Deeply connected data – not designed for this
  • 4. NOSQL is NOT No SQL The NOSQL Alternative
  • 5. NOSQL is simply Not only SQL The NOSQL Alternative
  • 6. NOSQL – So what else is it? “One size fits all” RDBMS is not working NOSQL alternatives are polyglot solutions that better fit the new requirements thrown up by the trends. They can be categorized along these axes - Data Model - simple to complex Scalability – single to horizontal Persistence
  • 7. NOSQL categories Graph Databases Based on Graph theory Data model – graph, nodes, edges, properties Scalability – single node – high performance Persistence – On disk data structures Examples – Neo4J, AllegroGraph Document Databases Based loosely on documents/Lotus Notes Data model – collections of documents Scalability – horizontal, auto-sharding & replication Persistence – B-Tree Examples – mongoDB, CouchDB
  • 8. NOSQL categories Column Stores Based on Google’s BigTable design Data model - big table, column families Scalability – horizontal, auto-sharding & replication Persistence – Memory + File (on DFS) Examples – HBase, Cassandra Key Value Stores Based on DHT, Amazon’s Dynamo design Data model – collection of key value pairs Scalability – horizontal, auto-sharding & replication Persistence – Memory or File Examples – Redis, Amazon Dynamo, Voldemort
  • 10. Graph oriented data Graphs are ubiquitous – Social networks, wikis, the web, recommendation engines et. al. Deep trees, complex networks Graph traversal - apt for expressing graph related problems (shortest path, network size etc.)
  • 12. Why not RDBMS for large scale graphs? Difficult to model and traverse graphs in RDBMS recursive approaches - slow SQL queries that span many table joins Hacks like storing paths for trees
  • 13. Graph Databases Designed for efficient storage & traversal of large scale graphs Natural modeling of graph network - nodes, relationships and their properties Neo4J is a leading graph db Supports billions of nodes/edges, traverses depths of 1000 levels in ms, 1000x of RDBMS Handle large graphs that don't fit in memory - persistent transactional store optimized for graphs REST API and various language bindings Graph pattern matching, Cypher Query language, Indexer – Lucene
  • 15. All Paths & My Network size
  • 19. Mining your network Centrality Algorithms Closeness – who has the most followers on twitter Betweenness – who has more influential people following them Eigenvector – PageRank
  • 21. Flexible document oriented data Document style unstructured data - schema less – e.g. JSON documents No alter table needed like in an RDBMS, de-normalized data Useful for iterative/agile development Humongous scale - billions of documents, R/W traffic – millions/sec, horizontal scalability, availability mongoDB is a leading document database
  • 22. Document Database – Use cases Use cases : Archiving of historic data which has undergone many schema changes Flexible set of performance metrics – web site page views, unique visitors etc. - change over time – no need to update existing JSON documents Track near real time metrics - optimized increment of perf counters Geo Loc based mobile and gaming apps (Geospatial indices can be key here)
  • 23. Craigslist Archival Database Premium service to customers allowed search over their historical postings Archival (no purging) of 10 years of postings - billions of documents Schema changes across versions MySQL based archival database ALTER TABLE took a month to complete
  • 24.
  • 25.
  • 26. mongoDB Features JSON documents, collection oriented storage Rich, document-based queries Indexes on document attributes Fast in-place updates Scalability features Horizontal scalability Configurable replication and high-availability Auto-sharding & rebalancing Language specific drivers – Java, Scala, Ruby etc.
  • 28. Column Store Reasonably rich data model – sparse, distributed, persistent multi-dimensional sorted map Sorted row keys, columns Use cases - Large scale data storage and analysis like - Time series data along with associated dimension data Row keys are timestamps and thus sorted – helps time range queries Google analytics Provides aggregate statistics, # unique visitors/day, page views/URL/day Raw click table has a row for each URL + user session time ~200 TB – ensures contiguous URLs chronologically sorted Data Cube - CPU OS Time DC
  • 29. Column Store Performance Excellent R/W performance – large storage – PB’s High scalability - horizontal scaling, auto-sharding High Availability - transparent replication of data HBase is a leading column store on – built on Hadoop HDFS as the underlying persistence
  • 30. Column Store - HBase Table defines Column Families - groups similar attributes , vertical partitioning (Table, Row, ColumnFamily: Column, Timestamp) tuple maps to a cell - value  Table is split into multiple equal distributed regions each of which is a range of sorted keys (partitioned automatically by the key) Ordered Rows by key, Ordered columns in a Column Family Rows can have different number of columns Columns have value and versions (any number) Row range & column range and key range queries
  • 33. Key Value Stores Simplest possible data model Caching a user’s personalized, rendered page – avoid DB S3 bucket storage for blob data against a unique id Range of KV stores Distributed, scaleable persistent key-value storage – Dynamo, Voldemort Auto-Partitioned key space Replicated KV Highly Available Largely in-memory KV stores – Redis, memcached Redis blazing fast for cache and other interesting operations
  • 34. Redis In memory KV store Blazing fast – 100 K/sec R/W Async snapshot to disk More than KV store – a data structure store – Supports lists, queues, sets and operations on them Sorted list range operations Set operations UNION, INTERSECTION, DIFF
  • 35. Redis – Use Cases Web session caching with EXPIRE set for session expiry Live real time bit.ly URL stats like clicks etc – fast increments of counters Auto Complete – Type first few characters – maps to a sort list and a range query is fired Publish / Subscribe – fan out a message to subscribers Set operations – My Twitter <Followers INTERSECTION Followees> - tells me who all I follow but they don’t follow me back
  • 36. Thanks Email : abhijit.sharma@gmail.comTwitter : sharmaabhijitBlog : abhijitsharma.blogspot.com