SlideShare a Scribd company logo
Big Data- GITPRO 2013
By - Sameer Wadkar
Co-Founder & Big Data Architect / Data Scientist at Axiomine
Agenda
• What is Big Data
• Big Data Characteristics
• Big Data and Business Intelligence Applications
• Big Data and Transactional Applications
• Demo
What is Big Data?
Volume
Velocity
Big
Data
Variety
Big Data monitors 12 Terabytes
of Tweets each day to improve
product sentiment analysis
(source :IBM)
Amazon and PayPal
use Big-Data for real
time fraud detection
(source: McKinsey)
In 15 of the US economy’s 17
sectors, companies with upward of
1,000 employees store, on average,
more information than the Library
of Congress (source: McKinsey)
Big Data monitors 12
TB of Tweets each day
to improve product
sentiment analysis
(source :IBM)
Most Big Data applications are based around the Volume dimension
Visualizing Big Data
• 1 Petabyte is 54000 movies in digital format
• Reading 1 Terabyte of data sequentially from a single disk drive
takes 3 hours
• Typical speed to read from the hard-disk – 80 MB/sec
• Traversing 1 Terabyte of data randomly over 1 disk (a typical
database access scenario) requires orders of magnitude longer
• Disk transfer rates are significantly higher than disk seek rate
Single node processing capacity will drown in the face of Big Data
Big Data vs. Traditional
Big Data Architecture
…
In Big Data architectures the application is moves to the data. Why?
User launches a batch job
1
Three Tier Architecture
App Request Data from Data Tier
2
Data Tier sends data to the App Tier
3
App Tier
processes data4
App Tier sends the report
5
User requests a report
1
Master Distributes Application
2
Master launches App on nodes
35
User downloads results
4
All Nodes
process the
data on their
nodes
Master Node
Application & Data Tier
Data Tier
Application Tier
Why is Big Data hard?
Divide out and conquer in place is a Big Data Strategy
• Goal is to divide the data on multiple nodes and conquer by
processing the data in-place of the node.
• Real world processing cannot be always divided into smaller sub-
problems (Divide and Conquer is not always feasible)
• Data has dependencies
• Normalization v/s Denormalization
• There are processing dependencies. Later phase of the process may
require results of an earlier phase
• Single Pass v/s Multi-pass
Big Data Characteristics
Scale-out, Fault Tolerance & Graceful Recovery are essential features
• Big Data Systems must scale out
• Adding more nodes should lead to greater parallelization
• Big Data Systems must be resilient to partial failure
• If one part of the system fails other parts should continue to
function
• Big Data Systems must be able to self-recover from partial failure
• If any part of the system fails another part of the system will
attempt to recover from the failure.
• Data must be replicated on separate nodes
• Loss of any node does not lose data or processing.
• Recovery should be transparent to the end-user.
Big Data Applications
Big Data design is dictated by the nature of the applications
• Business Intelligence applications
• Read-only systems
• ETL Systems
• Query massive data for purpose of generating reports or for large
scale transformations and import into destination data-source
• Transactional Applications
• One part of the system updates data while another part reads the
data
• Example Systems – Imagine running a online store of Amazon.com
scale.
BI - Sample Use-Case
A very simple query but size makes all the difference
• “Select SUM(SALES_AMT) from SALES where state=‘MD’ group by
YEAR order by YEAR”
• Find me total sales revenue by year for “Maryland” and order them
by year
• What if SALES table has billions of rows over 20 years?
Sales Transactions
Table
Big Data
Reporting
Year Sales Revenue
1980 11 Million
1981 13 Million
… …
2010 10 Billion
Input
Output
BI Big Data Flavors
We discuss three flavors in increasing order of scale-out capability
Big Data Flavor Products
In-Memory Databases Oracle Exalytics, SAP HANA
Massively Parallel Computing (MPP) Greenplum, Netezza
Map Reduce Hadoop
In Memory Databases
If State=‘VA’ is next query & cache is only big enough to hold one state
Simplified version - Data is partitioned randomly across all nodes.
Selection Phase
1. Each data node contains fast Memory (SSD) and
mechanism to apply “Where” clause
2. Only the necessary data (“MD” records) are
passed over the expensive Network I/O to the
processing node
Processing Phase
1. The processing nodes will compute the
SUM(sales_amt) by year
2. Order the results
3. Place it In-Memory cache
• First execution of the query is slow.
• Subsequent executions are very fast (almost real-time) as the cache is
hot.
• Cache has SQL-Interface. User experiences “Real-Time”!!
Data Node Data Node …… Data Node
Processing Node
In-Memory TB Cache
with SQL Interface
User SQL
Interface
Fetch Phase
The user is served the results from the cache through
the familiar SQL Interface
In Memory Databases (cont.)
In-Memory DBs provide real-time querying on moderate sized data
• Specialized hardware
• Specialized I/O and Flash Memory for faster I/O
• Massive in-memory cache (Multi-Terabyte TB) with SQL Interface
Characteristics
• Familiar model (SQL Interface)
• Can integrate with standard toolkits and BI Solutions
• Unified software/hardware solution
Pros
• Vendor lock-in
• Expensive – Hardware as well as licensing cost
• Typically cannot scale beyond 1-2 TB of data
• Works best when same data is read often (Cache remains hot).
Cons
MPP (Typical Architecture)
Data is partitioned horizontally across all slave nodes. Assume “Sale Year” is the distribution
key. Secondary indexes by other keys can be added to each slave node.
Distributed Query Phase
1. Each salve node will compute the query
for the data contained in its own node.
2. Each year data is completely held in its
own node
3. This phase produces partial query results
which are complete for each year
Slave Node
(1980 & 1990 data)
Slave Node
(1981 & 1991 data) .. Slave Node
(2000 & 2010 data)
Master Node
Accumulation Phase
1. All slave results are aggregated and sorted.
• Scale Out – More nodes means less years of data per node.
• Redundancy & Failover – Each node will have a backup node.
• Data distribution strategy & access patterns compatibility
determine performance.
• Enormous network overhead if access-patterns do not respect
distribution strategy
MPP (cont.)
MPP supports familiar RDBMS paradigm for medium scalability
• Balances throughput with responsiveness.
• Some implementations use specialized hardware (Ex. Netezza uses FPGA)
• Familiar RDBMS (SQL) paradigm
• Can scale to 10’s of Terabytes in most cases
Characteristics
• Familiar model (SQL Interface)
• Can integrate with standard toolkits and BI Solutions
Pros
• Vendor lock-in
• Cannot scale for ad-hoc queries
• Queries must respect data distribution strategy for acceptable performance.
Cons
MapReduce
Data is partitioned randomly/redundantly across all data nodes. Every data node contain sales
data for every state and every year.
Map Phase
1. Each data node reads all of its
records sequentially.
2. It filters out all non- “MD” state
records
3. It computes a SUM(sales_amt) by
year for each year
Data Node Data Node … Data Node
Reduce Node
Reduce Phase
1. Reduce node receives
SUM(sales_amt) for state “MD” by
each year from each node
2. Add all map results by year and
compute the final SUM(sales_amt)
by year for “MD” sales
3. Orders results by year
• Data blocks (order of 128 MB) are stored and accessed contiguously
• Scales out efficiently and degrades gracefully.
• If a task fails the framework restarts automatically (on another node
if necessary) – Redundancy and Graceful Recovery
Master Node
Map Nodes
MapReduce (cont.)
Map Reduce – How it works
Year Sales
1990 $1M
1982 $2M
… ..
1999 $20M
Map Process 1
Year Sales
1998 $6M
1982 $5M
… ..
2010 $30M
Map Process 20
……
Reduce Node adds up all the map
results, sorts by year to give final
result
Year Sales
1980 $100M
1981 $102M
… ..
2010 $250M
MapReduce (cont.)
MapReduce is general purpose but requires complex skills.
• Batch oriented - Maximizes throughput not responsiveness
Characteristics
• Simple programming model
• Scales out efficiently
• Failure and redundancy built in
• Adapts well to a wide variety of problems
Pros
• Requires custom programming
• Higher level languages (SQL-like) exist but programming skills are often
critical
• Requires a complex array of skills to manage & maintain a MapReduce
System
Cons
Summary of BI Apps
Each option has tradeoffs. Choose based on requirements
Big Data Flavor How much data can it typically handle?
In Memory
Databases
Order of 1TB
Massively Parallel
Databases
Order of 10 TB
MapReduce Order of 100’s of TB into the Petabyte
range
Transactional System - Use-Case
How many items in stock do users A and B on their second access?
Web Based Online
Store Database
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
Context – CAP Theorem
You can get any two but not all three features in any system
Characteristic
Consistency All nodes (and users) see the same data
at the same time.
Availability A guarantee that every request receives
a valid response. Site does not go down
or appear down under heavy load.
Partition Tolerance The system continues to function
regardless of loss of one of its
components
CA – Single RDBMS
A single RDBMS instance is both consistent and available
Web Based Online
Store RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• When setup in “Read Committed” every user sees the same
inventory count
• System responds with last committed inventory count even during
updates
• Consistent
• Available
CP – Distributed RDBMS
A Distributed RDBMS is consistent and resilient to failure of nodes
Web Based Online
Store
East Region
RDBMS
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• Under “Read Committed” mode all user see consistent counts.
• If one DB fails the other one will serve all users(Partition Tolerance)
• During two phase commit system is unavailable.
• Consistent
• Partition Tolerant
West Region
RDBMS
2- Phase
Commit
AP – Distributed RDBMS
Eventual Consistency is the key to Big Data Transactional Systems
Web Based Online Store
User A Looks up
item X
User B Looks up
item X
User C buys item X
Updates inventory
User A Looks up
item X again
User B Looks up
item X again
• Amazon Dynamo and Apache Cassandra work on this principle
• If one DB fails the other one will serve all users(Partition Tolerance)
• Users will always be able to browse all products but occasionally
some users will see a stale count of inventory (Eventual Consistency)
• Available
• Partition Tolerant
• Eventually Consistent
Hybrid Solution
Big Data Techniques – Not an either or choice!
Large
Structured DB
Large
Unstructured
DB
Map Reduce
based ETL MPP DB
In-Memory
DB
Business Users can
use familiar SQL
based tools in
real-time. In-
Memory DB
allows that
No-SQL DB
Programmers, System
Admins with no real-time
requirements can use all
three techniques. NoSQL
DB’s allow technical users
to gain real-time benefits
in ways which suite their
complex needs.
Familiar BI
Solution
Programs &
Scripts
100 TB to
1 PB
5-10 TB
1 TB
Few 100
GB
Exploring over Millions US Patent Pages at the Speed of Thought
www.axiomine.com/patents/
Demo- US Patent Explorer
Patent Explorer Goals
Seamlessly navigate Structured and Unstructured data in real-time
• Navigate 3 million US Patents Data (Text and Metadata) from 1963 to
1999 at the speed of thought.
• Data Sources
• Patent Metadata - National Bureau of Economic Research
• Patent Text – Bulk Download from Google Site
• Each week granted patents are published to the Google Site as an
archive.
• Size of uncompressed data
• Structured Metadata – Approximately 2 GB
• Patent Text Data – Approximately 300 GB
Patent Metadata
Cannot answer – What is the title of Patent No 8086905?
Source – National Bureau of Economic Research
http://data.nber.org/patents/
Patent Master
Pairwise
Citations
*
Inventors
*
Patent Master Other Master Data
Company
Master
Country
Master
Classification
Master
Contains only meta-data. No text data such as Patent Title available.
Ex. Pairwise citations contains millions of patent id pairs
Patent Text
Need to merge both metadata & text
Source – Google
http://www.google.com/googlebooks/uspto.html
Sample File
High Level Architecture
Need to merge both metadata & text
Hadoop
Patent
Metadata
Patent Text
Navigation, Search
& Text Analytics
Apache Solr
Patent Details
MongoDB
Text Enhanced
Citation Data
Raw Data Tier ETL & Text Analytics Tier Search & Visualization
Navigate, Search
& Visualize
Drill down to
Patent Details
Big Data Flavors – Summary
Choose a Big Data tool and product based on requirements
Flavor Characteristics
Map-Reduce • Massive 100 TB to 1 PB Scale ELT
• Complex Analytics on Massive Data
• Large Scale Unstructured Data Analysis
Massively Parallel
Processing (MPP)
• Batch oriented aggregations
• Analytics on Moderately Large Structured Data with
predictable access patterns
In-Memory DB • Similar to MPP but with real-time access patterns required.
• Rich and Interactive Business Intelligence Apps
NoSQL databases • Similar to In-Memory DB but simpler (Non SQL) access
patterns
• Provide fast access to detail data where other techniques are
used to serve summary data
GPGPU • Real time Value At Risk (Financial Risk Management)
• Compute intensive analytics Ex. Simulation of a Hospital
Waiting Room over 1 years

More Related Content

What's hot

Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
Eyad Manna
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
Yogendra Uikey
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
Y Parandama Reddy
 
data warehousing
data warehousingdata warehousing
data warehousing
Jagnesh Chawla
 
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
MongoDB
 
Data warehousing
Data warehousingData warehousing
Data warehousing
Vigneshwaar Ponnuswamy
 
MariaDB AX: Solución analítica con ColumnStore
MariaDB AX: Solución analítica con ColumnStoreMariaDB AX: Solución analítica con ColumnStore
MariaDB AX: Solución analítica con ColumnStore
MariaDB plc
 
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB AX: Analytics with MariaDB ColumnStoreMariaDB AX: Analytics with MariaDB ColumnStore
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB plc
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
Seeling Cheung
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Seeling Cheung
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
Vibrant Technologies & Computers
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
Prof.Nilesh Magar
 
How In Memory Computing Changes Everything
How In Memory Computing Changes EverythingHow In Memory Computing Changes Everything
How In Memory Computing Changes Everything
Debajit Banerjee
 
Five Tuning Tips For Your Datawarehouse
Five Tuning Tips For Your DatawarehouseFive Tuning Tips For Your Datawarehouse
Five Tuning Tips For Your Datawarehouse
Jeff Moss
 
Data Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubeyData Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubey
Ankita Dubey
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
Vibrant Event
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
Vibrant Technologies & Computers
 
Data warehousing ppt
Data warehousing pptData warehousing ppt
Data warehousing ppt
Ashish Kumar Thakur
 
Project+team+1 slides (2)
Project+team+1 slides (2)Project+team+1 slides (2)
Project+team+1 slides (2)
Vijay Pappu, Ph.D.
 

What's hot (20)

Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
Data warehouse
Data warehouse Data warehouse
Data warehouse
 
Datawarehousing
DatawarehousingDatawarehousing
Datawarehousing
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...Webinar: Achieving Customer Centricity and High Margins in Financial Services...
Webinar: Achieving Customer Centricity and High Margins in Financial Services...
 
Data warehousing
Data warehousingData warehousing
Data warehousing
 
MariaDB AX: Solución analítica con ColumnStore
MariaDB AX: Solución analítica con ColumnStoreMariaDB AX: Solución analítica con ColumnStore
MariaDB AX: Solución analítica con ColumnStore
 
MariaDB AX: Analytics with MariaDB ColumnStore
MariaDB AX: Analytics with MariaDB ColumnStoreMariaDB AX: Analytics with MariaDB ColumnStore
MariaDB AX: Analytics with MariaDB ColumnStore
 
Hadoop and SQL: Delivery Analytics Across the Organization
Hadoop and SQL:  Delivery Analytics Across the OrganizationHadoop and SQL:  Delivery Analytics Across the Organization
Hadoop and SQL: Delivery Analytics Across the Organization
 
Concept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with TelematicsConcept to production Nationwide Insurance BigInsights Journey with Telematics
Concept to production Nationwide Insurance BigInsights Journey with Telematics
 
Teradata - Restoring Data
Teradata - Restoring Data Teradata - Restoring Data
Teradata - Restoring Data
 
Data-ware Housing
Data-ware HousingData-ware Housing
Data-ware Housing
 
How In Memory Computing Changes Everything
How In Memory Computing Changes EverythingHow In Memory Computing Changes Everything
How In Memory Computing Changes Everything
 
Five Tuning Tips For Your Datawarehouse
Five Tuning Tips For Your DatawarehouseFive Tuning Tips For Your Datawarehouse
Five Tuning Tips For Your Datawarehouse
 
Data Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubeyData Warehouses & Deployment By Ankita dubey
Data Warehouses & Deployment By Ankita dubey
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
ETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testingETL Testing - Introduction to ETL testing
ETL Testing - Introduction to ETL testing
 
Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.Data ware housing - Introduction to data ware housing process.
Data ware housing - Introduction to data ware housing process.
 
Data warehousing ppt
Data warehousing pptData warehousing ppt
Data warehousing ppt
 
Project+team+1 slides (2)
Project+team+1 slides (2)Project+team+1 slides (2)
Project+team+1 slides (2)
 

Similar to Big Data presentation at GITPRO 2013

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
Data Driven Innovation
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
Priyadarshini648418
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
Amazon Web Services
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the field
Stéphane Dorrekens
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
Denodo
 
Cloud dwh
Cloud dwhCloud dwh
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
VIJAYAPRABAP
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
MongoDB
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
STS FORUM 2016
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
AnjaliKumari301316
 
Big data
Big dataBig data
Big data
roysonli
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
jlorenzocima
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
tomflemingh2
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
Deepak Chaurasia
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
Kumari Surabhi
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
Insight Technology, Inc.
 
MongoDB and In-Memory Computing
MongoDB and In-Memory ComputingMongoDB and In-Memory Computing
MongoDB and In-Memory Computing
Dylan Tong
 

Similar to Big Data presentation at GITPRO 2013 (20)

L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo BrignoliL'architettura di classe enterprise di nuova generazione - Massimo Brignoli
L'architettura di classe enterprise di nuova generazione - Massimo Brignoli
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
AWS re:Invent 2016| DAT318 | Migrating from RDBMS to NoSQL: How Sony Moved fr...
 
Dynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the fieldDynamics CRM high volume systems - lessons from the field
Dynamics CRM high volume systems - lessons from the field
 
Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?Can data virtualization uphold performance with complex queries?
Can data virtualization uphold performance with complex queries?
 
Cloud dwh
Cloud dwhCloud dwh
Cloud dwh
 
bigdata.pptx
bigdata.pptxbigdata.pptx
bigdata.pptx
 
When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...When to Use MongoDB...and When You Should Not...
When to Use MongoDB...and When You Should Not...
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
 
bigdata.pdf
bigdata.pdfbigdata.pdf
bigdata.pdf
 
Big data
Big dataBig data
Big data
 
Development of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data GridsDevelopment of concurrent services using In-Memory Data Grids
Development of concurrent services using In-Memory Data Grids
 
Best storage engine for MySQL
Best storage engine for MySQLBest storage engine for MySQL
Best storage engine for MySQL
 
Data ware house architecture
Data ware house architectureData ware house architecture
Data ware house architecture
 
BigData Hadoop
BigData Hadoop BigData Hadoop
BigData Hadoop
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
[db tech showcase Tokyo 2017] C37: MariaDB ColumnStore analytics engine : use...
 
MongoDB and In-Memory Computing
MongoDB and In-Memory ComputingMongoDB and In-Memory Computing
MongoDB and In-Memory Computing
 

Recently uploaded

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
Daiki Mogmet Ito
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
shyamraj55
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 

Recently uploaded (20)

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
How to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For FlutterHow to use Firebase Data Connect For Flutter
How to use Firebase Data Connect For Flutter
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with SlackLet's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 

Big Data presentation at GITPRO 2013

  • 1. Big Data- GITPRO 2013 By - Sameer Wadkar Co-Founder & Big Data Architect / Data Scientist at Axiomine
  • 2. Agenda • What is Big Data • Big Data Characteristics • Big Data and Business Intelligence Applications • Big Data and Transactional Applications • Demo
  • 3. What is Big Data? Volume Velocity Big Data Variety Big Data monitors 12 Terabytes of Tweets each day to improve product sentiment analysis (source :IBM) Amazon and PayPal use Big-Data for real time fraud detection (source: McKinsey) In 15 of the US economy’s 17 sectors, companies with upward of 1,000 employees store, on average, more information than the Library of Congress (source: McKinsey) Big Data monitors 12 TB of Tweets each day to improve product sentiment analysis (source :IBM) Most Big Data applications are based around the Volume dimension
  • 4. Visualizing Big Data • 1 Petabyte is 54000 movies in digital format • Reading 1 Terabyte of data sequentially from a single disk drive takes 3 hours • Typical speed to read from the hard-disk – 80 MB/sec • Traversing 1 Terabyte of data randomly over 1 disk (a typical database access scenario) requires orders of magnitude longer • Disk transfer rates are significantly higher than disk seek rate Single node processing capacity will drown in the face of Big Data
  • 5. Big Data vs. Traditional Big Data Architecture … In Big Data architectures the application is moves to the data. Why? User launches a batch job 1 Three Tier Architecture App Request Data from Data Tier 2 Data Tier sends data to the App Tier 3 App Tier processes data4 App Tier sends the report 5 User requests a report 1 Master Distributes Application 2 Master launches App on nodes 35 User downloads results 4 All Nodes process the data on their nodes Master Node Application & Data Tier Data Tier Application Tier
  • 6. Why is Big Data hard? Divide out and conquer in place is a Big Data Strategy • Goal is to divide the data on multiple nodes and conquer by processing the data in-place of the node. • Real world processing cannot be always divided into smaller sub- problems (Divide and Conquer is not always feasible) • Data has dependencies • Normalization v/s Denormalization • There are processing dependencies. Later phase of the process may require results of an earlier phase • Single Pass v/s Multi-pass
  • 7. Big Data Characteristics Scale-out, Fault Tolerance & Graceful Recovery are essential features • Big Data Systems must scale out • Adding more nodes should lead to greater parallelization • Big Data Systems must be resilient to partial failure • If one part of the system fails other parts should continue to function • Big Data Systems must be able to self-recover from partial failure • If any part of the system fails another part of the system will attempt to recover from the failure. • Data must be replicated on separate nodes • Loss of any node does not lose data or processing. • Recovery should be transparent to the end-user.
  • 8. Big Data Applications Big Data design is dictated by the nature of the applications • Business Intelligence applications • Read-only systems • ETL Systems • Query massive data for purpose of generating reports or for large scale transformations and import into destination data-source • Transactional Applications • One part of the system updates data while another part reads the data • Example Systems – Imagine running a online store of Amazon.com scale.
  • 9. BI - Sample Use-Case A very simple query but size makes all the difference • “Select SUM(SALES_AMT) from SALES where state=‘MD’ group by YEAR order by YEAR” • Find me total sales revenue by year for “Maryland” and order them by year • What if SALES table has billions of rows over 20 years? Sales Transactions Table Big Data Reporting Year Sales Revenue 1980 11 Million 1981 13 Million … … 2010 10 Billion Input Output
  • 10. BI Big Data Flavors We discuss three flavors in increasing order of scale-out capability Big Data Flavor Products In-Memory Databases Oracle Exalytics, SAP HANA Massively Parallel Computing (MPP) Greenplum, Netezza Map Reduce Hadoop
  • 11. In Memory Databases If State=‘VA’ is next query & cache is only big enough to hold one state Simplified version - Data is partitioned randomly across all nodes. Selection Phase 1. Each data node contains fast Memory (SSD) and mechanism to apply “Where” clause 2. Only the necessary data (“MD” records) are passed over the expensive Network I/O to the processing node Processing Phase 1. The processing nodes will compute the SUM(sales_amt) by year 2. Order the results 3. Place it In-Memory cache • First execution of the query is slow. • Subsequent executions are very fast (almost real-time) as the cache is hot. • Cache has SQL-Interface. User experiences “Real-Time”!! Data Node Data Node …… Data Node Processing Node In-Memory TB Cache with SQL Interface User SQL Interface Fetch Phase The user is served the results from the cache through the familiar SQL Interface
  • 12. In Memory Databases (cont.) In-Memory DBs provide real-time querying on moderate sized data • Specialized hardware • Specialized I/O and Flash Memory for faster I/O • Massive in-memory cache (Multi-Terabyte TB) with SQL Interface Characteristics • Familiar model (SQL Interface) • Can integrate with standard toolkits and BI Solutions • Unified software/hardware solution Pros • Vendor lock-in • Expensive – Hardware as well as licensing cost • Typically cannot scale beyond 1-2 TB of data • Works best when same data is read often (Cache remains hot). Cons
  • 13. MPP (Typical Architecture) Data is partitioned horizontally across all slave nodes. Assume “Sale Year” is the distribution key. Secondary indexes by other keys can be added to each slave node. Distributed Query Phase 1. Each salve node will compute the query for the data contained in its own node. 2. Each year data is completely held in its own node 3. This phase produces partial query results which are complete for each year Slave Node (1980 & 1990 data) Slave Node (1981 & 1991 data) .. Slave Node (2000 & 2010 data) Master Node Accumulation Phase 1. All slave results are aggregated and sorted. • Scale Out – More nodes means less years of data per node. • Redundancy & Failover – Each node will have a backup node. • Data distribution strategy & access patterns compatibility determine performance. • Enormous network overhead if access-patterns do not respect distribution strategy
  • 14. MPP (cont.) MPP supports familiar RDBMS paradigm for medium scalability • Balances throughput with responsiveness. • Some implementations use specialized hardware (Ex. Netezza uses FPGA) • Familiar RDBMS (SQL) paradigm • Can scale to 10’s of Terabytes in most cases Characteristics • Familiar model (SQL Interface) • Can integrate with standard toolkits and BI Solutions Pros • Vendor lock-in • Cannot scale for ad-hoc queries • Queries must respect data distribution strategy for acceptable performance. Cons
  • 15. MapReduce Data is partitioned randomly/redundantly across all data nodes. Every data node contain sales data for every state and every year. Map Phase 1. Each data node reads all of its records sequentially. 2. It filters out all non- “MD” state records 3. It computes a SUM(sales_amt) by year for each year Data Node Data Node … Data Node Reduce Node Reduce Phase 1. Reduce node receives SUM(sales_amt) for state “MD” by each year from each node 2. Add all map results by year and compute the final SUM(sales_amt) by year for “MD” sales 3. Orders results by year • Data blocks (order of 128 MB) are stored and accessed contiguously • Scales out efficiently and degrades gracefully. • If a task fails the framework restarts automatically (on another node if necessary) – Redundancy and Graceful Recovery Master Node Map Nodes
  • 16. MapReduce (cont.) Map Reduce – How it works Year Sales 1990 $1M 1982 $2M … .. 1999 $20M Map Process 1 Year Sales 1998 $6M 1982 $5M … .. 2010 $30M Map Process 20 …… Reduce Node adds up all the map results, sorts by year to give final result Year Sales 1980 $100M 1981 $102M … .. 2010 $250M
  • 17. MapReduce (cont.) MapReduce is general purpose but requires complex skills. • Batch oriented - Maximizes throughput not responsiveness Characteristics • Simple programming model • Scales out efficiently • Failure and redundancy built in • Adapts well to a wide variety of problems Pros • Requires custom programming • Higher level languages (SQL-like) exist but programming skills are often critical • Requires a complex array of skills to manage & maintain a MapReduce System Cons
  • 18. Summary of BI Apps Each option has tradeoffs. Choose based on requirements Big Data Flavor How much data can it typically handle? In Memory Databases Order of 1TB Massively Parallel Databases Order of 10 TB MapReduce Order of 100’s of TB into the Petabyte range
  • 19. Transactional System - Use-Case How many items in stock do users A and B on their second access? Web Based Online Store Database User A Looks up item X User B Looks up item X User C buys item X Updates inventory User A Looks up item X again User B Looks up item X again
  • 20. Context – CAP Theorem You can get any two but not all three features in any system Characteristic Consistency All nodes (and users) see the same data at the same time. Availability A guarantee that every request receives a valid response. Site does not go down or appear down under heavy load. Partition Tolerance The system continues to function regardless of loss of one of its components
  • 21. CA – Single RDBMS A single RDBMS instance is both consistent and available Web Based Online Store RDBMS User A Looks up item X User B Looks up item X User C buys item X Updates inventory User A Looks up item X again User B Looks up item X again • When setup in “Read Committed” every user sees the same inventory count • System responds with last committed inventory count even during updates • Consistent • Available
  • 22. CP – Distributed RDBMS A Distributed RDBMS is consistent and resilient to failure of nodes Web Based Online Store East Region RDBMS User A Looks up item X User B Looks up item X User C buys item X Updates inventory User A Looks up item X again User B Looks up item X again • Under “Read Committed” mode all user see consistent counts. • If one DB fails the other one will serve all users(Partition Tolerance) • During two phase commit system is unavailable. • Consistent • Partition Tolerant West Region RDBMS 2- Phase Commit
  • 23. AP – Distributed RDBMS Eventual Consistency is the key to Big Data Transactional Systems Web Based Online Store User A Looks up item X User B Looks up item X User C buys item X Updates inventory User A Looks up item X again User B Looks up item X again • Amazon Dynamo and Apache Cassandra work on this principle • If one DB fails the other one will serve all users(Partition Tolerance) • Users will always be able to browse all products but occasionally some users will see a stale count of inventory (Eventual Consistency) • Available • Partition Tolerant • Eventually Consistent
  • 24. Hybrid Solution Big Data Techniques – Not an either or choice! Large Structured DB Large Unstructured DB Map Reduce based ETL MPP DB In-Memory DB Business Users can use familiar SQL based tools in real-time. In- Memory DB allows that No-SQL DB Programmers, System Admins with no real-time requirements can use all three techniques. NoSQL DB’s allow technical users to gain real-time benefits in ways which suite their complex needs. Familiar BI Solution Programs & Scripts 100 TB to 1 PB 5-10 TB 1 TB Few 100 GB
  • 25. Exploring over Millions US Patent Pages at the Speed of Thought www.axiomine.com/patents/ Demo- US Patent Explorer
  • 26. Patent Explorer Goals Seamlessly navigate Structured and Unstructured data in real-time • Navigate 3 million US Patents Data (Text and Metadata) from 1963 to 1999 at the speed of thought. • Data Sources • Patent Metadata - National Bureau of Economic Research • Patent Text – Bulk Download from Google Site • Each week granted patents are published to the Google Site as an archive. • Size of uncompressed data • Structured Metadata – Approximately 2 GB • Patent Text Data – Approximately 300 GB
  • 27. Patent Metadata Cannot answer – What is the title of Patent No 8086905? Source – National Bureau of Economic Research http://data.nber.org/patents/ Patent Master Pairwise Citations * Inventors * Patent Master Other Master Data Company Master Country Master Classification Master Contains only meta-data. No text data such as Patent Title available. Ex. Pairwise citations contains millions of patent id pairs
  • 28. Patent Text Need to merge both metadata & text Source – Google http://www.google.com/googlebooks/uspto.html Sample File
  • 29. High Level Architecture Need to merge both metadata & text Hadoop Patent Metadata Patent Text Navigation, Search & Text Analytics Apache Solr Patent Details MongoDB Text Enhanced Citation Data Raw Data Tier ETL & Text Analytics Tier Search & Visualization Navigate, Search & Visualize Drill down to Patent Details
  • 30. Big Data Flavors – Summary Choose a Big Data tool and product based on requirements Flavor Characteristics Map-Reduce • Massive 100 TB to 1 PB Scale ELT • Complex Analytics on Massive Data • Large Scale Unstructured Data Analysis Massively Parallel Processing (MPP) • Batch oriented aggregations • Analytics on Moderately Large Structured Data with predictable access patterns In-Memory DB • Similar to MPP but with real-time access patterns required. • Rich and Interactive Business Intelligence Apps NoSQL databases • Similar to In-Memory DB but simpler (Non SQL) access patterns • Provide fast access to detail data where other techniques are used to serve summary data GPGPU • Real time Value At Risk (Financial Risk Management) • Compute intensive analytics Ex. Simulation of a Hospital Waiting Room over 1 years