SlideShare a Scribd company logo
1© Cloudera, Inc. All rights reserved.
Introducing RecordService
Lenni Kuff
2© Cloudera, Inc. All rights reserved.
RecordService is a distributed,
scalable, data access service for
unified authorization in Hadoop.
3© Cloudera, Inc. All rights reserved.
Motivation
• As the Hadoop ecosystem expands, new components continue to be added
• Speaks to the overall flexibility of Hadoop
• This is good - more functionality, more workloads, more use cases.
• As use cases for Hadoop mature, user requirements and expectations increase:
• Security
• Performance
• Compatibility
• The flexibility of Hadoop has come at cost of increased complexity
4© Cloudera, Inc. All rights reserved.
Storage
Compute
5© Cloudera, Inc. All rights reserved.
Storage
Compute
…
6© Cloudera, Inc. All rights reserved.
Example: Security
Challenge: Provide unified fine-grained security across compute frameworks
• Integrating consistent security layer into every components is not scalable.
• Securing data at file-level precludes fine grained access control (column/row)
• File ACLs not enough - User can view all or nothing.
• Currently, must split files, duplicate data – large operational cost.
Solution: Add a level of abstraction - secure service to access datasets in “record”
format
• Can now apply fine-grained constraints on projection of dataset
• Same access control policy can be applied uniformly across compute
frameworks; uncoupled from underlying storage layer
7© Cloudera, Inc. All rights reserved.
Introducing RecordService
8© Cloudera, Inc. All rights reserved.
Record Service - Overview
• Simplifies
• Provides a higher level, logical abstraction for data (ie Tables or Views)
• Returns schemed objects (instead of paths and bytes). No need for applications
to worry about storage APIs and file formats.
• HCatalog? Similar concept - RecordService is secure, performant. Plan to
support HCatalog as a data model on RecordService.
• Secures
• Central location for all authorization checks using Sentry metadata.
• Secure service that does not execute arbitrary user code
• Accelerates
• Unified data access path allows platform-wide performance improvements.
9© Cloudera, Inc. All rights reserved.
Architecture
10© Cloudera, Inc. All rights reserved.
Architecture
• Runs as a distributed service: Planner Servers & Worker Servers
• Servers do not store any state
• Easy HA, fault tolerance.
• Planner Servers responsible for request planning
• Retrieve and combine metadata (NN, HMS, Sentry)
• Split generation -> Creates tasks for workers
• Performs authorization
• Worker Servers reads from storage and constructs records.
• IO, file parsing, predicate evaluation
• Runs as the “source” for a DAG computation
11© Cloudera, Inc. All rights reserved.
Architecture – Server APIs
• Planner and Worker services expose thrift APIs
• PlanRequest(), Exec(), Fetch()
• PlanRequest()
• Accepts SQL to specify request: Support SELECT and PROJECT
• Access to tables and views stored in HMS
• Does not run operators that require data exchange; “map only”
• Generates a list of tasks which contain the request, each with locality
• Exec()/Fetch()
• Returns records in a canonical optimized, columnar-format.
12© Cloudera, Inc. All rights reserved.
Architecture – Fault tolerance
• Cluster state persisted in ZK
• Membership, delegation tokens, secret keys
• Servers do not communicate with each other directly => scalability
• Planner services
• Expected to run a few (i.e. 3) for HA
• Fault tolerance handled with clients getting a list of planners and failing over
• Plan requests are short
• Worker services
• Expect to run on each node in the cluster with data
• Fault tolerance handled by framework (e.g. MR) rescheduling task
13© Cloudera, Inc. All rights reserved.
Architecture – Security
• Authentication using Kerberos and delegation tokens
• Planner authorizes request using metadata in Sentry
• Column level ACLs
• Row level ACLs – create a view with a predicate
• Masking – create a view with the masking function in the select list
• Tasks generated by the planner are signed with a shared key
• Worker runs generated tasks.
• Does not authorize, relies on signed tasks
• Runs as user with full access to data, does not run user code
14© Cloudera, Inc. All rights reserved.
Architecture – Security example
CREATE VIEW v as
SELECT mask(credit_card_number) as ccn,
name, balance, region
FROM data WHERE region = “Europe”
1. Restrict access to the data set: disable access to ‘data’ table and underlying
files in HDFS.
2. Give access by creating view, v
3. Set column level permissions on v per user if necessary
Write path (ingest) unchanged. Job expected to run as privileged user.
15© Cloudera, Inc. All rights reserved.
Client APIs – Integration with ecosystem
• Similar APIs designed to integrate with MapReduce and Spark
• Client APIs make things simpler
• Don’t need to interact with HMS
• Care about the underlying storage format: worker always returns records in a
canonical format.
• Storage engine details (e.g. s3)
16© Cloudera, Inc. All rights reserved.
Client Integration APIs
• Drop in replacements for common existing InputFormats
• Text, Avro
• Can be used with Spark as well
• SparkSQL: integration with the Data Sources API
• Predicate pushdown, projection
• Migration should be easy
17© Cloudera, Inc. All rights reserved.
MR Example
//FileInputFormat.setInputPaths(job, new Path(args[0]));
//job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);
job.setInputFormatClass(
com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
18© Cloudera, Inc. All rights reserved.
Spark Example
// Comment out one or the other
val file = sc.recordServiceTextFile(path)
//val file = sc.textFile(path)
19© Cloudera, Inc. All rights reserved.
Spark SQL Example
ctx.sql(s"""
|CREATE TEMPORARY TABLE $tbl
|USING com.cloudera.recordservice.spark.DefaultSource
|OPTIONS (
| RecordServiceTable '$db.$tbl',
| RecordServiceTableSize '$size'
|)
""".stripMargin)
20© Cloudera, Inc. All rights reserved.
Performance
• Shares some core components with Impala
• IO management, optimized C++ code, runtime code generation, uses low level
storage APIs
• Highly efficient implementation of the scan functionality
• Optimized columnar on wire format
• Inspired by Apache Parquet
• Accelerates performance for many workloads
21© Cloudera, Inc. All rights reserved.
Terasort
• ~Worst case scenario. Minimal schema: a single STRING column
• Custom RecordServiceTeraInputFormat (similar to TeraInputFormat)
• 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks)
• Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
• See Github repo for more details and runnable examples.
22© Cloudera, Inc. All rights reserved.
TeraChecksum
1
0.48
0.23
1.03
0.8
0.85
0
0.2
0.4
0.6
0.8
1
1.2
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
Normalizedjobtime
TeraChecksum
Without RecordService
With RecordService
23© Cloudera, Inc. All rights reserved.
Spark SQL
• Represents a more expected use case
• Data is fully schemed
• TPCDS
• 500GB scale factor, on parquet
• Cluster
• 5 node cluster
24© Cloudera, Inc. All rights reserved.
0
50
100
150
200
250
300
350
TPCDS
SparkSQL
SparkSQL
SparkSQL with RecordService
Spark SQL
~15% improvement in query times; queries are not scan bound
25© Cloudera, Inc. All rights reserved.
Spark SQL
29.5
31
14
23.5
0
5
10
15
20
25
30
35
2% Selective Scan Sum(col)
SparkSQL
SparkSQL
SparkSQL with RecordService
26© Cloudera, Inc. All rights reserved.
State of the project
• Available in v0.2 beta:
• Integration with Spark, MR, Pig (via HCatalog)
• Planner HA
• Apache 2.0 Licensed
• Sentry Column-Level Privilege Support
• Mini Roadmap:
• Improved multi-tenancy
• Complex types
• More InputFormat support / integration options
• Intend to donate to Apache Software Foundation
27© Cloudera, Inc. All rights reserved.
Conclusion
• RecordService provides a schemed data access service for Hadoop
• Logical data access instead of physical
• Much more powerful abstraction
• Demonstrated security enforcement, improved performance
• Simpler: clients don’t need to worry about low level details: storage APIs, file
formats
• Opens the door for future improvements
28© Cloudera, Inc. All rights reserved.
Contributing!
• Mailing list: recordservice-user@googlegroups.com
• Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd-
p/Beta
• Contributions: http://github.com/cloudera/RecordServiceClient/
• Documentation: http://cloudera.github.io/RecordServiceClient/
• Bug Reporting: https://issues.cloudera.org/projects/RS
• Beta Download:
http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
29© Cloudera, Inc. All rights reserved.
Thank you

More Related Content

What's hot

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Cloudera, Inc.
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
Cloudera, Inc.
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
Cloudera, Inc.
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
Cloudera, Inc.
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
Cloudera, Inc.
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Cloudera, Inc.
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Jeremy Beard
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Cloudera, Inc.
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
Cloudera, Inc.
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for Telcos
Cloudera, Inc.
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
Cloudera, Inc.
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Spark Summit
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Cloudera, Inc.
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
Andriy Zabavskyy
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
Cloudera, Inc.
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
Cloudera, Inc.
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
Supriya Sahay
 

What's hot (20)

Data Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache HadoopData Science at Scale Using Apache Spark and Apache Hadoop
Data Science at Scale Using Apache Spark and Apache Hadoop
 
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
A deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloudA deep dive into running data analytic workloads in the cloud
A deep dive into running data analytic workloads in the cloud
 
Risk Management for Data: Secured and Governed
Risk Management for Data: Secured and GovernedRisk Management for Data: Secured and Governed
Risk Management for Data: Secured and Governed
 
Road to Cloudera certification
Road to Cloudera certificationRoad to Cloudera certification
Road to Cloudera certification
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Solr consistency and recovery internals
Solr consistency and recovery internalsSolr consistency and recovery internals
Solr consistency and recovery internals
 
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Using Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for TelcosUsing Hadoop to Drive Down Fraud for Telcos
Using Hadoop to Drive Down Fraud for Telcos
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
Hadoop Distributed File System (HDFS) Encryption with Cloudera Navigator Key ...
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Intel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data SuccessIntel and Cloudera: Accelerating Enterprise Big Data Success
Intel and Cloudera: Accelerating Enterprise Big Data Success
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Kudu Deep-Dive
Kudu Deep-DiveKudu Deep-Dive
Kudu Deep-Dive
 

Viewers also liked

Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
Cloudera, Inc.
 
PCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional AnalysisPCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional Analysis
Biju M R
 
Switchyard design overview
Switchyard design overviewSwitchyard design overview
Switchyard design overview
Milind Punj
 
Benefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic PackagingBenefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic Packaging
plasticingenuity
 
1. GRID COMPUTING
1. GRID COMPUTING1. GRID COMPUTING
1. GRID COMPUTING
Dr Sandeep Kumar Poonia
 
Cross cultural communication in business world
Cross cultural communication in business worldCross cultural communication in business world
Cross cultural communication in business worldonlyvvek
 
Waste water treatment processes
Waste water treatment processesWaste water treatment processes
Waste water treatment processesAshish Agarwal
 
Green Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and EngineeringGreen Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and Engineering
digitallibrary
 
Agile Product Management Basics
Agile Product Management BasicsAgile Product Management Basics
Agile Product Management BasicsRich Mironov
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Improving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure CloudImproving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure Cloud
IJASCSE
 
college assignment on Applications of ipsec
college assignment on Applications of ipsec college assignment on Applications of ipsec
college assignment on Applications of ipsec bigchill29
 
Compulsory motor third party liability in Mozambique
Compulsory motor third party liability in MozambiqueCompulsory motor third party liability in Mozambique
Compulsory motor third party liability in Mozambique
Tristan Wiggill
 
Informatica transformation guide
Informatica transformation guideInformatica transformation guide
Informatica transformation guidesonu_pal
 
How to measure illumination
How to measure illuminationHow to measure illumination
How to measure illuminationajsatienza
 
Top 8 print production manager resume samples
Top 8 print production manager resume samplesTop 8 print production manager resume samples
Top 8 print production manager resume sampleskelerdavi
 
Optimized Learning and Development
Optimized Learning and Development Optimized Learning and Development
Optimized Learning and Development AIESEC
 
Ironport Data Loss Prevention
Ironport Data Loss PreventionIronport Data Loss Prevention
Ironport Data Loss Prevention
dkaya
 
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management ExcellenceChange Management Institute
 

Viewers also liked (20)

Securing Your Apache Spark Applications
Securing Your Apache Spark ApplicationsSecuring Your Apache Spark Applications
Securing Your Apache Spark Applications
 
PCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional AnalysisPCRF-Policy Charging System-Functional Analysis
PCRF-Policy Charging System-Functional Analysis
 
Switchyard design overview
Switchyard design overviewSwitchyard design overview
Switchyard design overview
 
Benefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic PackagingBenefits And Applications of PET Plastic Packaging
Benefits And Applications of PET Plastic Packaging
 
1. GRID COMPUTING
1. GRID COMPUTING1. GRID COMPUTING
1. GRID COMPUTING
 
Cross cultural communication in business world
Cross cultural communication in business worldCross cultural communication in business world
Cross cultural communication in business world
 
Waste water treatment processes
Waste water treatment processesWaste water treatment processes
Waste water treatment processes
 
Green Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and EngineeringGreen Storage 1: Economics, Environment, Energy and Engineering
Green Storage 1: Economics, Environment, Energy and Engineering
 
Agile Product Management Basics
Agile Product Management BasicsAgile Product Management Basics
Agile Product Management Basics
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Improving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure CloudImproving Utilization of Infrastructure Cloud
Improving Utilization of Infrastructure Cloud
 
college assignment on Applications of ipsec
college assignment on Applications of ipsec college assignment on Applications of ipsec
college assignment on Applications of ipsec
 
Basics of print planning
Basics of print planningBasics of print planning
Basics of print planning
 
Compulsory motor third party liability in Mozambique
Compulsory motor third party liability in MozambiqueCompulsory motor third party liability in Mozambique
Compulsory motor third party liability in Mozambique
 
Informatica transformation guide
Informatica transformation guideInformatica transformation guide
Informatica transformation guide
 
How to measure illumination
How to measure illuminationHow to measure illumination
How to measure illumination
 
Top 8 print production manager resume samples
Top 8 print production manager resume samplesTop 8 print production manager resume samples
Top 8 print production manager resume samples
 
Optimized Learning and Development
Optimized Learning and Development Optimized Learning and Development
Optimized Learning and Development
 
Ironport Data Loss Prevention
Ironport Data Loss PreventionIronport Data Loss Prevention
Ironport Data Loss Prevention
 
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
6 May 2015 - INCREASING BANKING SALES PRODUCTIVITY - Management Excellence
 

Similar to Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks

Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
Mike Percy
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Amazon Web Services
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructureharendra_pathak
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
Cask Data
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman
 
Spark etl
Spark etlSpark etl
Spark etl
Imran Rashid
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemachtCloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera, Inc.
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Mladen Kovacevic
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
Cloudera, Inc.
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Mike Percy
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
Spark Summit
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
Leandro Totino Pereira
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
Avere Systems
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
John D Almon
 
Azure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish KalamatiAzure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish Kalamati
Girish Kalamati
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 

Similar to Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks (20)

Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
Migrate from Oracle to Aurora PostgreSQL: Best Practices, Design Patterns, & ...
 
Architectures, Frameworks and Infrastructure
Architectures, Frameworks and InfrastructureArchitectures, Frameworks and Infrastructure
Architectures, Frameworks and Infrastructure
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Spark etl
Spark etlSpark etl
Spark etl
 
Impala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Cloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemachtCloudera Altus: Big Data in der Cloud einfach gemacht
Cloudera Altus: Big Data in der Cloud einfach gemacht
 
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
Introducing Apache Kudu (Incubating) - Montreal HUG May 2016
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Backup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipesBackup multi-cloud solution based on named pipes
Backup multi-cloud solution based on named pipes
 
Building a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for AnalystsBuilding a Just-in-Time Application Stack for Analysts
Building a Just-in-Time Application Stack for Analysts
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
Azure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish KalamatiAzure from scratch part 3 By Girish Kalamati
Azure from scratch part 3 By Girish Kalamati
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 

Recently uploaded (20)

UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 

Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path for Compute Frameworks

  • 1. 1© Cloudera, Inc. All rights reserved. Introducing RecordService Lenni Kuff
  • 2. 2© Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop.
  • 3. 3© Cloudera, Inc. All rights reserved. Motivation • As the Hadoop ecosystem expands, new components continue to be added • Speaks to the overall flexibility of Hadoop • This is good - more functionality, more workloads, more use cases. • As use cases for Hadoop mature, user requirements and expectations increase: • Security • Performance • Compatibility • The flexibility of Hadoop has come at cost of increased complexity
  • 4. 4© Cloudera, Inc. All rights reserved. Storage Compute
  • 5. 5© Cloudera, Inc. All rights reserved. Storage Compute …
  • 6. 6© Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks • Integrating consistent security layer into every components is not scalable. • Securing data at file-level precludes fine grained access control (column/row) • File ACLs not enough - User can view all or nothing. • Currently, must split files, duplicate data – large operational cost. Solution: Add a level of abstraction - secure service to access datasets in “record” format • Can now apply fine-grained constraints on projection of dataset • Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer
  • 7. 7© Cloudera, Inc. All rights reserved. Introducing RecordService
  • 8. 8© Cloudera, Inc. All rights reserved. Record Service - Overview • Simplifies • Provides a higher level, logical abstraction for data (ie Tables or Views) • Returns schemed objects (instead of paths and bytes). No need for applications to worry about storage APIs and file formats. • HCatalog? Similar concept - RecordService is secure, performant. Plan to support HCatalog as a data model on RecordService. • Secures • Central location for all authorization checks using Sentry metadata. • Secure service that does not execute arbitrary user code • Accelerates • Unified data access path allows platform-wide performance improvements.
  • 9. 9© Cloudera, Inc. All rights reserved. Architecture
  • 10. 10© Cloudera, Inc. All rights reserved. Architecture • Runs as a distributed service: Planner Servers & Worker Servers • Servers do not store any state • Easy HA, fault tolerance. • Planner Servers responsible for request planning • Retrieve and combine metadata (NN, HMS, Sentry) • Split generation -> Creates tasks for workers • Performs authorization • Worker Servers reads from storage and constructs records. • IO, file parsing, predicate evaluation • Runs as the “source” for a DAG computation
  • 11. 11© Cloudera, Inc. All rights reserved. Architecture – Server APIs • Planner and Worker services expose thrift APIs • PlanRequest(), Exec(), Fetch() • PlanRequest() • Accepts SQL to specify request: Support SELECT and PROJECT • Access to tables and views stored in HMS • Does not run operators that require data exchange; “map only” • Generates a list of tasks which contain the request, each with locality • Exec()/Fetch() • Returns records in a canonical optimized, columnar-format.
  • 12. 12© Cloudera, Inc. All rights reserved. Architecture – Fault tolerance • Cluster state persisted in ZK • Membership, delegation tokens, secret keys • Servers do not communicate with each other directly => scalability • Planner services • Expected to run a few (i.e. 3) for HA • Fault tolerance handled with clients getting a list of planners and failing over • Plan requests are short • Worker services • Expect to run on each node in the cluster with data • Fault tolerance handled by framework (e.g. MR) rescheduling task
  • 13. 13© Cloudera, Inc. All rights reserved. Architecture – Security • Authentication using Kerberos and delegation tokens • Planner authorizes request using metadata in Sentry • Column level ACLs • Row level ACLs – create a view with a predicate • Masking – create a view with the masking function in the select list • Tasks generated by the planner are signed with a shared key • Worker runs generated tasks. • Does not authorize, relies on signed tasks • Runs as user with full access to data, does not run user code
  • 14. 14© Cloudera, Inc. All rights reserved. Architecture – Security example CREATE VIEW v as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = “Europe” 1. Restrict access to the data set: disable access to ‘data’ table and underlying files in HDFS. 2. Give access by creating view, v 3. Set column level permissions on v per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user.
  • 15. 15© Cloudera, Inc. All rights reserved. Client APIs – Integration with ecosystem • Similar APIs designed to integrate with MapReduce and Spark • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format: worker always returns records in a canonical format. • Storage engine details (e.g. s3)
  • 16. 16© Cloudera, Inc. All rights reserved. Client Integration APIs • Drop in replacements for common existing InputFormats • Text, Avro • Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection • Migration should be easy
  • 17. 17© Cloudera, Inc. All rights reserved. MR Example //FileInputFormat.setInputPaths(job, new Path(args[0])); //job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
  • 18. 18© Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) //val file = sc.textFile(path)
  • 19. 19© Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)
  • 20. 20© Cloudera, Inc. All rights reserved. Performance • Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality • Optimized columnar on wire format • Inspired by Apache Parquet • Accelerates performance for many workloads
  • 21. 21© Cloudera, Inc. All rights reserved. Terasort • ~Worst case scenario. Minimal schema: a single STRING column • Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales • See Github repo for more details and runnable examples.
  • 22. 22© Cloudera, Inc. All rights reserved. TeraChecksum 1 0.48 0.23 1.03 0.8 0.85 0 0.2 0.4 0.6 0.8 1 1.2 1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark) Normalizedjobtime TeraChecksum Without RecordService With RecordService
  • 23. 23© Cloudera, Inc. All rights reserved. Spark SQL • Represents a more expected use case • Data is fully schemed • TPCDS • 500GB scale factor, on parquet • Cluster • 5 node cluster
  • 24. 24© Cloudera, Inc. All rights reserved. 0 50 100 150 200 250 300 350 TPCDS SparkSQL SparkSQL SparkSQL with RecordService Spark SQL ~15% improvement in query times; queries are not scan bound
  • 25. 25© Cloudera, Inc. All rights reserved. Spark SQL 29.5 31 14 23.5 0 5 10 15 20 25 30 35 2% Selective Scan Sum(col) SparkSQL SparkSQL SparkSQL with RecordService
  • 26. 26© Cloudera, Inc. All rights reserved. State of the project • Available in v0.2 beta: • Integration with Spark, MR, Pig (via HCatalog) • Planner HA • Apache 2.0 Licensed • Sentry Column-Level Privilege Support • Mini Roadmap: • Improved multi-tenancy • Complex types • More InputFormat support / integration options • Intend to donate to Apache Software Foundation
  • 27. 27© Cloudera, Inc. All rights reserved. Conclusion • RecordService provides a schemed data access service for Hadoop • Logical data access instead of physical • Much more powerful abstraction • Demonstrated security enforcement, improved performance • Simpler: clients don’t need to worry about low level details: storage APIs, file formats • Opens the door for future improvements
  • 28. 28© Cloudera, Inc. All rights reserved. Contributing! • Mailing list: recordservice-user@googlegroups.com • Discussion forum: http://community.cloudera.com/t5/Beta-Releases/bd- p/Beta • Contributions: http://github.com/cloudera/RecordServiceClient/ • Documentation: http://cloudera.github.io/RecordServiceClient/ • Bug Reporting: https://issues.cloudera.org/projects/RS • Beta Download: http://www.cloudera.com/downloads/beta/record-service/0-2-0.html
  • 29. 29© Cloudera, Inc. All rights reserved. Thank you

Editor's Notes

  1. In this talk we will be introducing Record Service … In Short, RecordService is a highly scalable, distributed, data access service for Hadoop that provides unified authorization while also simplifying the platform.
  2. Before digging in to the details of RecordService, let’s take a step back and look at the current state of the Hadoop ecosystem. What we have seen is more components, continue added to the stack at an accelerated rate.
  3. * RS provides layer of abstraction over storage so compute frameworks don’t need to care as where data is stored Provides platform for uniform, fine grained security across all compute engines Helps to simplify Hadoop – Unified data access path
  4. Single place for performance enhancements