SlideShare a Scribd company logo
1 of 36
Download to read offline
Mastering Data with Spark
and ML
Strata London 2019
About Me
IIT Delhi, 1998
Founder and CEO, Nube Technologies
Strata Data San Jose Program Committee
Speaker at Spark Summit, Strata, GIDS etc
Nube
India based startup
Deep technical problems with an enterprise solution
ML, Big Data, UX
This talk today
Problem Statement
Our Approach
Simple business asks
Customer LTV
Best supplier for a part
Supplier payment terms
Householding
Cross Sell Opportunities
M&A
Actual Data
Actual data - real facts
Silos
Data Quality
Volumes
Challenges
Variety of sources
Scale
Capturing rules for matching and merging for different business entities
Wishlist
Any source and format
Any entity type
Any volume
Reifier
AI powered data management, matching and merging different data sources to
build a holistic view.
- MDM
- Fraud and Analytics
- Sales and Marketing
- Customer AML/KYC/cross and Upsell
- Data Enrichment
- Reference data Management
- Data Quality
Our stack
Wishlist
Any source and format
Any entity type
Any scale
Any source, Any format
Based on RDDs
RDD = fault-tolerant collection of elements that can be operated on in parallel.
Custom source and sink formats written by us/borrowed from community
Any source, Any format
Elastic:
import org.elasticsearch.spark.rdd.EsSpark
EsSpark.saveToEs(rdd, "spark/docs")
Cassandra:
val rdd = sc.cassandraTable("test", "kv")
rdd.saveToCassandra("test", "kv", SomeColumns("key", "value"))
Problems with RDDs
Record wise reading was good, but adding structure to the data was left to us.
reifier.Tuple - indexed data structure
Development and maintenance nightmare
Reifier 2.0
- Datasets
- Pipe abstraction
public class Pipe implements Serializable{
String name;
Format format;
Map<String, String> props;
StructType schema;
}
Building Dataset through Pipe
DataFrameReader reader = spark.read();
reader = reader.format(p.getFormat().type());
reader = reader.schema(p.getSchema());
for (String key: p.getProps().keySet()) {
reader = reader.option(key, p.get(key));
}
Dataset<Row> input = reader.load();
A word of caution
Abstraction is good, but
Wishlist
Any source and format
Any entity type
Any scale
Any entity type
Traditional rule based system fails
AI to the rescue
Also Cassandra
Reifier Interactive Learner
Reifier Interactive Learner
Wishlist
Any source and format
Any entity type
Any scale
Any scale
Add Spark to the mix
Ouch, cartesian join - 1 million records = Order of a trillion comparisons
Learn what to join
AutoML
Build multiple models based on the training data
Optimize for accuracy and performance
Use Spark to train and assess different models
Spark Integration
Livy
Spark Job Server
Spark Integration
Two ways in which we integrate.
-Local SparkContext.
-Programmatically through hidden Spark REST service
Cassandra
Any Entity
Any Scale
Cassandra Training
Primary Key - Cluster Id, Record Id
Secondary Index - r_isMatch
Cassandra Entity
Primary Key - Record Id
Secondary Index - Cluster Id
Elastic
View over Cassandra
Free flowing search
Ad hoc analytics
Realtime Plugin
Wishlist
Any source and format
Any entity type
Any scale
Thank You!
www.nubetech.co
sonal@nubetech.co

More Related Content

What's hot

BigData-Architecture
BigData-ArchitectureBigData-Architecture
BigData-Architecture
Narayana B
 

What's hot (20)

Overview of Oracle Database 18c Express Edition (XE)
Overview of Oracle Database 18c Express Edition (XE)Overview of Oracle Database 18c Express Edition (XE)
Overview of Oracle Database 18c Express Edition (XE)
 
Solution architecture
Solution architectureSolution architecture
Solution architecture
 
Building Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 stepsBuilding Knowledge Graphs in 10 steps
Building Knowledge Graphs in 10 steps
 
Solution architecture for big data projects
Solution architecture for big data projectsSolution architecture for big data projects
Solution architecture for big data projects
 
The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise The Bounties of Semantic Data Integration for the Enterprise
The Bounties of Semantic Data Integration for the Enterprise
 
Connected data meetup group - introduction & scope
Connected data meetup group - introduction & scopeConnected data meetup group - introduction & scope
Connected data meetup group - introduction & scope
 
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
SFScon19 - Grazia Cazzin - KNOWAGE the open source answer to the new needs in...
 
Enterprise architecture for big data projects
Enterprise architecture for big data projectsEnterprise architecture for big data projects
Enterprise architecture for big data projects
 
SciDB
SciDBSciDB
SciDB
 
Getting started with Cosmos DB + Linkurious Enterprise
Getting started with Cosmos DB + Linkurious EnterpriseGetting started with Cosmos DB + Linkurious Enterprise
Getting started with Cosmos DB + Linkurious Enterprise
 
It Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got SemanticsIt Don’t Mean a Thing If It Ain’t Got Semantics
It Don’t Mean a Thing If It Ain’t Got Semantics
 
How to visualize Cosmos DB graph data
How to visualize Cosmos DB graph dataHow to visualize Cosmos DB graph data
How to visualize Cosmos DB graph data
 
BigData-Architecture
BigData-ArchitectureBigData-Architecture
BigData-Architecture
 
Using Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve QualityUsing Knowledge Graphs to Predict Customer Needs and Improve Quality
Using Knowledge Graphs to Predict Customer Needs and Improve Quality
 
schema.org, Linked Data's Gateway Drug
schema.org, Linked Data's Gateway Drugschema.org, Linked Data's Gateway Drug
schema.org, Linked Data's Gateway Drug
 
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
A secure and dynamic multi keyword ranked search scheme over encrypted cloud ...
 
Building NextGen Enterprise data platforms | Graham Cousins
Building NextGen Enterprise data platforms | Graham CousinsBuilding NextGen Enterprise data platforms | Graham Cousins
Building NextGen Enterprise data platforms | Graham Cousins
 
Artificial Intelligence for Data Quality
Artificial Intelligence for Data QualityArtificial Intelligence for Data Quality
Artificial Intelligence for Data Quality
 
How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps How to migrate to GraphDB in 10 easy to follow steps
How to migrate to GraphDB in 10 easy to follow steps
 
Using the Semantic Web Stack to Make Big Data Smarter
Using the Semantic Web Stack to Make  Big Data SmarterUsing the Semantic Web Stack to Make  Big Data Smarter
Using the Semantic Web Stack to Make Big Data Smarter
 

Similar to Master Data Management Using AI

Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
vivekjv
 
Ipedo Company Overview
Ipedo Company OverviewIpedo Company Overview
Ipedo Company Overview
Tim_Matthews
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Amazon Web Services
 

Similar to Master Data Management Using AI (20)

How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?How to govern and secure a Data Mesh?
How to govern and secure a Data Mesh?
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Big Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptxBig Data Transformation Powered By Apache Spark.pptx
Big Data Transformation Powered By Apache Spark.pptx
 
Big Data Transformations Powered By Spark
Big Data Transformations Powered By SparkBig Data Transformations Powered By Spark
Big Data Transformations Powered By Spark
 
Ipedo Company Overview
Ipedo Company OverviewIpedo Company Overview
Ipedo Company Overview
 
Accelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success StoriesAccelerating Insight - Smart Data Lake Customer Success Stories
Accelerating Insight - Smart Data Lake Customer Success Stories
 
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
Freddie Mac & KPMG Case Study – Advanced Machine Learning Data Integration wi...
 
Bringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to SalesforceBringing the Power of Big Data Computation to Salesforce
Bringing the Power of Big Data Computation to Salesforce
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Logical Data Fabric: Architectural Components
Logical Data Fabric: Architectural ComponentsLogical Data Fabric: Architectural Components
Logical Data Fabric: Architectural Components
 
ACdP Fiware.pdf
ACdP Fiware.pdfACdP Fiware.pdf
ACdP Fiware.pdf
 
Kaizentric Presentation
Kaizentric PresentationKaizentric Presentation
Kaizentric Presentation
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Master Meta Data
Master Meta DataMaster Meta Data
Master Meta Data
 
黑豹 ch4 ddd pattern practice (2)
黑豹 ch4 ddd pattern practice (2)黑豹 ch4 ddd pattern practice (2)
黑豹 ch4 ddd pattern practice (2)
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
 
Unleashing the power of apache atlas with apache - virtual dataconnector
Unleashing the power of apache atlas with apache  - virtual dataconnectorUnleashing the power of apache atlas with apache  - virtual dataconnector
Unleashing the power of apache atlas with apache - virtual dataconnector
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
 
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdfData Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
Data Lakes and Analytics Dow Jones - AWS FS Cloud Symposium Apr 2019.pdf
 
The Power of Data
The Power of DataThe Power of Data
The Power of Data
 

Recently uploaded

Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
HyderabadDolls
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 

Recently uploaded (20)

SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
Diamond Harbour \ Russian Call Girls Kolkata | Book 8005736733 Extreme Naught...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
Oral Sex Call Girls Kashmiri Gate Delhi Just Call 👉👉 📞 8448380779 Top Class C...
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 

Master Data Management Using AI