SlideShare a Scribd company logo
1 of 22
Download to read offline
1
Take advantage of ALL of your data
Augmenting Big Data Analytics
with Nirvana
Sept 2016 Igor Sfiligoi
2
• Nirvana® is a metadata, data placement
and data management solution optimized for
managing distributed unstructured data
• It supports many modes of operation
– In this talk we explore only how it
fits in a Big Data Analytics context
– All the other capabilities can be used alongside, but will not be discussed
• Nirvana is a commercial software product,
developed by General Atomics
• More information at:
– http://www.ga.com/nirvana
– https://en.Wikipedia.org/wiki/Nirvana_(software)
What is Nirvana?
3
• Big Data Analytics is
– The process of examining
large and diverse data sets to uncover
hidden patterns and previously unknown correlations
– Extensively used both in the enterprises
and in science circles
• No single tool can do the whole job
– Custom data extraction needed
to accommodate all the possible data formats
– Efficient filtering and processing frameworks needed
due to the large data volumes
What is Big Data Analytics?
4
• Structured data
– Well defined schema
– e.g rows in a database
• Unstructured data
– Usually describes something in great detail
– Requires custom code to extract actionable information
– e.g. images -> walls with cracks , or
raw instrument readouts -> phase change coordinates
• Semi-structured data
– No fixed schema, but still easily parsable
– Several variants:
• Subset of schema fixed, others optional
• Tree like structures, where each level is well defined, depth variable
• Self describing structures
– e.g. JSON documents
Types of data
5
Most data
comes as
unstructured data
Final analysis
must be done on
structured data
Data bridging
How do we bridge the gap?
6
• Most data comes as unstructured data
• Final analysis must be done on structured data
• How do we bridge the gap?
– The final structured data is refined
from the original unstructured (raw) data
– The structured data is often called metadata
• Two extremes to get from raw data to metadata
– Extract metadata during ingest, drop raw data
– Keep raw data, extract metadata during analysis
Data refinement
7
• Extracting data at ingest time
– Makes analysis very fast
– But very rigid,
can only answer a fixed number of questions
• Sometimes called ETL (Extract And Transform)
• This is where traditional (SQL) databases shine
– Example single node DBs: PostgreSQL, MariaDB, …
– Example large scale DBs: Teradata, Oracle, …
Refinement at ingest
8
• Refining data at analysis time
– Extremely flexible, can answer any question
– Extremely (computationally) expensive
• Recent Big Data frameworks were developed to
tackle this at scale
– e.g. Hadoop’s MapReduce
Refinement during analysis
9
• In practice, everyone wants it both way
– Fast, and
– Flexible
Two basic approaches:
• Semi-structured data
– Keep much more metadata, with flexible schema
– Make is relatively cheap to further refine
• Tiered systems
– Extract some metadata at ingest time
– Keep the original raw files
– Link the two together
The middle road
Metadata could be
semi-structured
10
• Using the semi-structured approach enables
– Much more flexibility
– Can use some of the optimization techniques
used with truly structured data
• However
– Still cannot answer all the questions
(we lost a large fraction of the original information)
– Still not as fast as truly structured data
(flexibility has its price)
• Popularized by recent NoSQL databases
– e.g. MongoDB
– Most “traditional” (SQL) databases have added these
capabilities over the past few years, too
The semi-structured approach
11
• A tiered approach uses the best tool for the job
– A database for the metadata (possibly SQL, but not required)
– A Big Data framework for raw data processing
– A metadata-aware data management system
for linking the two
• The best tool is used as appropriate
– Use the database whenever possible
(i.e. if it fits in the domain of existing metadata)
– Else
• Use the data management system to get the subset of
raw data objects to analyze (as much as possible)
• Use the Big Data framework on the subset
to get the desired answers
– If appropriate, feed the new metadata into the database
The tiered approach
12
Tiered Analytics in a picture
Can be
solved with
existing
metadata?
? Answer
Process
raw files
Process
raw files
Process
raw files
Process
raw files
Big Data framework
(Optionally)
Save extracted
metadata, so next
queries run faster
Database
Query
Database
Relevant
raw files
Mine metadata
Query
Database
Mine available
metadata
yes
no
Physical
storage
Pre-digested Tier
Fast but limited
Power Tier
Flexible but slower
13
Winning strategy – stage one
Composed of three layers
• Database
• Big Data Framework
• Metadata-aware
data management system
A tiered approach to
Big Data Analytics
provides the
best competitive advantage
14
• Nirvana is the metadata-aware data management system
– Provides the means for linking metadata
with raw data objects
• Three fundamental roles
– Provides standardized schema
– Manages registration of files in the database
(plus updates, renames and deletions, autonomously)
– Bridges database and storage security domains
(user identity and permission)
• Additionally, automated extraction of metadata from files
– Triggered on creation and update
– Extraction rules defined by system administrators
– But users can add additional metadata anytime
(if authorized)
Nirvana’s role
15
• But what about Big Data SQL databases?
– e.g. Hive, Presto
• Tools like Hive are just a cost saving solution
– They do not provide capabilities not-present in
high-end “traditional” SQL databases, like Teradata
– But they do provide a better value per TByte stored
(at a moderate cost in query performance)
• They should be used as an additional tier
– Hot metadata in a “traditional” database
– Rarely used metadata in a “low cost” database
– Possibly with transparent gluing between them
(e.g. Teradata QueryGrid)
Wait a minute…
16
The slides so far were assuming
a homogeneous environment
• Not a very realistic scenario
these days
A typical enterprise will have
several storage and
compute technologies deployed
• Organized into a Data Lake
Big Data Analytics in a Data Lake
17
• A single logical repository for
all data handled by an enterprise
– As opposed to having
different data in different data silos
• Logically integrated
storage and compute infrastructure
– Since data analytics requires both
• See also
http://www.slideshare.net/igor_sfiligoi/creating-a-real-data-lake-with-nirvana
What is a Data Lake?
18
• All the infrastructure is logically related, but
– Different technical solutions
are optimized for different factors
• e.g. speed vs reliability vs cost
– Not every compute platform will work
with every storage solution
• During Big Data Analysis, data must often
be migrated between repositories
– Often just to maximize efficiency
– Sometimes there simply is no other option
Data Lake Analytics challenges
19
• Moving data around manually not an option
• A flexible data management system essential
– Global namespace
– Transparent, fully automated
data movement and replication
– Able to interface with
solutions from multiple vendors
• And it also must be metadata-aware
– Tiered Big Data analytics needs metadata-file pairing
– These pairs must be preserved across file moves/replicas
Truly integrated infrastructure
20
Real Big Data Analytics in a picture
Can be
solved with
existing
metadata?
? Answer
The appropriate
Big Data
framework
(Optionally)
Save extracted
metadata, so next
queries run faster
Database
Query
Database
Relevant
raw files
Mine metadata
Query
Database
Mine available
metadata
yes
no
Compatible
storageArchival
Storage
Locate files and
handle data
movement
(if needed)
Cloud
Storage Interactive
Storage
Logical to physical
file mapping
Pre-digested Tier
Fast but limited
Data Lake Tier
Flexible but slower
Data Management
Layer
21
Winning strategy – stage two
Big Data Analytics
over a
truly integrated Data Lake
provides the
best competitive advantage
Composed of three layers
• Database
• Data Lake
• Flexible, metadata-aware
data management system
22
• Nirvana is the flexible,
metadata-aware data management system
– Metadata capabilities described in previous slides
• Supports multiple storage technologies,
from multiple vendors
– Creates a logical, global namespace
• Fully integrated data movement
and replication capabilities
– Can be API driven
– Plus, a fully automated policy engine, too
Nirvana’s role
See also: http://www.slideshare.net/igor_sfiligoi/building-a-global-namespace-with-nirvana

More Related Content

What's hot

Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologiesneeraj rathore
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopArchana Gopinath
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best PracticesEduardo Castro
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introductionMurli Jha
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 

What's hot (20)

Big Data Open Source Technologies
Big Data Open Source TechnologiesBig Data Open Source Technologies
Big Data Open Source Technologies
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Data Warehouse
Data Warehouse Data Warehouse
Data Warehouse
 
Bigdata
BigdataBigdata
Bigdata
 
Bigdata
Bigdata Bigdata
Bigdata
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
TESTING IN BIG DATA WORLD
TESTING IN BIG DATA  WORLDTESTING IN BIG DATA  WORLD
TESTING IN BIG DATA WORLD
 
Nosql data models
Nosql data modelsNosql data models
Nosql data models
 
Data Warehouse Best Practices
Data Warehouse Best PracticesData Warehouse Best Practices
Data Warehouse Best Practices
 
Data warehouse introduction
Data warehouse introductionData warehouse introduction
Data warehouse introduction
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 

Viewers also liked

HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaIgor Sfiligoi
 
Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Igor Sfiligoi
 
Quatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratQuatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratMarcel Jorba
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogMSAdvAnalytics
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data LakeWaterlineData
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceHortonworks
 

Viewers also liked (7)

HSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and NirvanaHSM migration with EasyHSM and Nirvana
HSM migration with EasyHSM and Nirvana
 
EasyHSM Overview
EasyHSM OverviewEasyHSM Overview
EasyHSM Overview
 
Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014Introduction to security in the Open Science Grid - OSG School 2014
Introduction to security in the Open Science Grid - OSG School 2014
 
Quatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de BatxilleratQuatre experiments de Física amb làser per a 2n de Batxillerat
Quatre experiments de Física amb làser per a 2n de Batxillerat
 
Cortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data CatalogCortana Analytics Workshop: Azure Data Catalog
Cortana Analytics Workshop: Azure Data Catalog
 
Deploying a Governed Data Lake
Deploying a Governed Data LakeDeploying a Governed Data Lake
Deploying a Governed Data Lake
 
Implementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data GovernanceImplementing a Data Lake with Enterprise Grade Data Governance
Implementing a Data Lake with Enterprise Grade Data Governance
 

Similar to Augmenting Big Data Analytics with Nirvana

Business intelligence and data warehouses
Business intelligence and data warehousesBusiness intelligence and data warehouses
Business intelligence and data warehousesDhani Ahmad
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMichael Hiskey
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvewKunal Khanna
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningSergey Karayev
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxPriyadarshini648418
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupSri Kanajan
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systemselliando dias
 

Similar to Augmenting Big Data Analytics with Nirvana (20)

unit 1 big data.pptx
unit 1 big data.pptxunit 1 big data.pptx
unit 1 big data.pptx
 
Chapter 5 data resource management
Chapter 5  data resource managementChapter 5  data resource management
Chapter 5 data resource management
 
Business intelligence and data warehouses
Business intelligence and data warehousesBusiness intelligence and data warehouses
Business intelligence and data warehouses
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Chapter2.pdf
Chapter2.pdfChapter2.pdf
Chapter2.pdf
 
Big data
Big dataBig data
Big data
 
Big data and hadoop overvew
Big data and hadoop overvewBig data and hadoop overvew
Big data and hadoop overvew
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 GenoaHadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
 
LECTURE4.ppt
LECTURE4.pptLECTURE4.ppt
LECTURE4.ppt
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Data Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptxData Science Machine Lerning Bigdat.pptx
Data Science Machine Lerning Bigdat.pptx
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Storage Systems For Scalable systems
Storage Systems For Scalable systemsStorage Systems For Scalable systems
Storage Systems For Scalable systems
 
DW (1).ppt
DW (1).pptDW (1).ppt
DW (1).ppt
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingIgor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesIgor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateIgor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeIgor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Igor Sfiligoi
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessIgor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputIgor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsIgor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROIgor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstIgor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyIgor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCIgor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsIgor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsIgor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksIgor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory AccessAccelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Augmenting Big Data Analytics with Nirvana

  • 1. 1 Take advantage of ALL of your data Augmenting Big Data Analytics with Nirvana Sept 2016 Igor Sfiligoi
  • 2. 2 • Nirvana® is a metadata, data placement and data management solution optimized for managing distributed unstructured data • It supports many modes of operation – In this talk we explore only how it fits in a Big Data Analytics context – All the other capabilities can be used alongside, but will not be discussed • Nirvana is a commercial software product, developed by General Atomics • More information at: – http://www.ga.com/nirvana – https://en.Wikipedia.org/wiki/Nirvana_(software) What is Nirvana?
  • 3. 3 • Big Data Analytics is – The process of examining large and diverse data sets to uncover hidden patterns and previously unknown correlations – Extensively used both in the enterprises and in science circles • No single tool can do the whole job – Custom data extraction needed to accommodate all the possible data formats – Efficient filtering and processing frameworks needed due to the large data volumes What is Big Data Analytics?
  • 4. 4 • Structured data – Well defined schema – e.g rows in a database • Unstructured data – Usually describes something in great detail – Requires custom code to extract actionable information – e.g. images -> walls with cracks , or raw instrument readouts -> phase change coordinates • Semi-structured data – No fixed schema, but still easily parsable – Several variants: • Subset of schema fixed, others optional • Tree like structures, where each level is well defined, depth variable • Self describing structures – e.g. JSON documents Types of data
  • 5. 5 Most data comes as unstructured data Final analysis must be done on structured data Data bridging How do we bridge the gap?
  • 6. 6 • Most data comes as unstructured data • Final analysis must be done on structured data • How do we bridge the gap? – The final structured data is refined from the original unstructured (raw) data – The structured data is often called metadata • Two extremes to get from raw data to metadata – Extract metadata during ingest, drop raw data – Keep raw data, extract metadata during analysis Data refinement
  • 7. 7 • Extracting data at ingest time – Makes analysis very fast – But very rigid, can only answer a fixed number of questions • Sometimes called ETL (Extract And Transform) • This is where traditional (SQL) databases shine – Example single node DBs: PostgreSQL, MariaDB, … – Example large scale DBs: Teradata, Oracle, … Refinement at ingest
  • 8. 8 • Refining data at analysis time – Extremely flexible, can answer any question – Extremely (computationally) expensive • Recent Big Data frameworks were developed to tackle this at scale – e.g. Hadoop’s MapReduce Refinement during analysis
  • 9. 9 • In practice, everyone wants it both way – Fast, and – Flexible Two basic approaches: • Semi-structured data – Keep much more metadata, with flexible schema – Make is relatively cheap to further refine • Tiered systems – Extract some metadata at ingest time – Keep the original raw files – Link the two together The middle road Metadata could be semi-structured
  • 10. 10 • Using the semi-structured approach enables – Much more flexibility – Can use some of the optimization techniques used with truly structured data • However – Still cannot answer all the questions (we lost a large fraction of the original information) – Still not as fast as truly structured data (flexibility has its price) • Popularized by recent NoSQL databases – e.g. MongoDB – Most “traditional” (SQL) databases have added these capabilities over the past few years, too The semi-structured approach
  • 11. 11 • A tiered approach uses the best tool for the job – A database for the metadata (possibly SQL, but not required) – A Big Data framework for raw data processing – A metadata-aware data management system for linking the two • The best tool is used as appropriate – Use the database whenever possible (i.e. if it fits in the domain of existing metadata) – Else • Use the data management system to get the subset of raw data objects to analyze (as much as possible) • Use the Big Data framework on the subset to get the desired answers – If appropriate, feed the new metadata into the database The tiered approach
  • 12. 12 Tiered Analytics in a picture Can be solved with existing metadata? ? Answer Process raw files Process raw files Process raw files Process raw files Big Data framework (Optionally) Save extracted metadata, so next queries run faster Database Query Database Relevant raw files Mine metadata Query Database Mine available metadata yes no Physical storage Pre-digested Tier Fast but limited Power Tier Flexible but slower
  • 13. 13 Winning strategy – stage one Composed of three layers • Database • Big Data Framework • Metadata-aware data management system A tiered approach to Big Data Analytics provides the best competitive advantage
  • 14. 14 • Nirvana is the metadata-aware data management system – Provides the means for linking metadata with raw data objects • Three fundamental roles – Provides standardized schema – Manages registration of files in the database (plus updates, renames and deletions, autonomously) – Bridges database and storage security domains (user identity and permission) • Additionally, automated extraction of metadata from files – Triggered on creation and update – Extraction rules defined by system administrators – But users can add additional metadata anytime (if authorized) Nirvana’s role
  • 15. 15 • But what about Big Data SQL databases? – e.g. Hive, Presto • Tools like Hive are just a cost saving solution – They do not provide capabilities not-present in high-end “traditional” SQL databases, like Teradata – But they do provide a better value per TByte stored (at a moderate cost in query performance) • They should be used as an additional tier – Hot metadata in a “traditional” database – Rarely used metadata in a “low cost” database – Possibly with transparent gluing between them (e.g. Teradata QueryGrid) Wait a minute…
  • 16. 16 The slides so far were assuming a homogeneous environment • Not a very realistic scenario these days A typical enterprise will have several storage and compute technologies deployed • Organized into a Data Lake Big Data Analytics in a Data Lake
  • 17. 17 • A single logical repository for all data handled by an enterprise – As opposed to having different data in different data silos • Logically integrated storage and compute infrastructure – Since data analytics requires both • See also http://www.slideshare.net/igor_sfiligoi/creating-a-real-data-lake-with-nirvana What is a Data Lake?
  • 18. 18 • All the infrastructure is logically related, but – Different technical solutions are optimized for different factors • e.g. speed vs reliability vs cost – Not every compute platform will work with every storage solution • During Big Data Analysis, data must often be migrated between repositories – Often just to maximize efficiency – Sometimes there simply is no other option Data Lake Analytics challenges
  • 19. 19 • Moving data around manually not an option • A flexible data management system essential – Global namespace – Transparent, fully automated data movement and replication – Able to interface with solutions from multiple vendors • And it also must be metadata-aware – Tiered Big Data analytics needs metadata-file pairing – These pairs must be preserved across file moves/replicas Truly integrated infrastructure
  • 20. 20 Real Big Data Analytics in a picture Can be solved with existing metadata? ? Answer The appropriate Big Data framework (Optionally) Save extracted metadata, so next queries run faster Database Query Database Relevant raw files Mine metadata Query Database Mine available metadata yes no Compatible storageArchival Storage Locate files and handle data movement (if needed) Cloud Storage Interactive Storage Logical to physical file mapping Pre-digested Tier Fast but limited Data Lake Tier Flexible but slower Data Management Layer
  • 21. 21 Winning strategy – stage two Big Data Analytics over a truly integrated Data Lake provides the best competitive advantage Composed of three layers • Database • Data Lake • Flexible, metadata-aware data management system
  • 22. 22 • Nirvana is the flexible, metadata-aware data management system – Metadata capabilities described in previous slides • Supports multiple storage technologies, from multiple vendors – Creates a logical, global namespace • Fully integrated data movement and replication capabilities – Can be API driven – Plus, a fully automated policy engine, too Nirvana’s role See also: http://www.slideshare.net/igor_sfiligoi/building-a-global-namespace-with-nirvana