SlideShare a Scribd company logo
Multi Source Data Analysis
Using Apache Spark and Tellius
https://github.com/phatak-dev/spark2.0-examples
● Madhukara Phatak
● Director of
Engineering,Tellius
● Work on Hadoop, Spark , ML
and Scala
● www.madhukaraphatak.com
Agenda
● Multi Source Data
● Challenges with Multi Source
● Traditional and Data Lake Approach
● Spark Approach
● Data Source and Data Frame API
● Tellius Platform
● Multi Source analysis in Tellius
Multi Source Data
Multi Source Data
● In the era of cloud computing and big data, data for
analysis can come from various sources
● In every organization, it has become very common to
have multiple different sources to store wide variety of
storage system
● The nature of the data will vary from source to source
● Data can be structured, semi structured or fully
unstructured also.
Multi Source Example in Ecommerce
● Relational databases are used to hold product details
and customer transactions
● Big data warehousing tools like Hadoop/Hive/Impala are
used to store historical transactions and ratings for
analytics
● Google analytics to store the website analytics data
● Log Data in S3/ Azure Blog
● Every storage system is optimized to store specific type
of data
Multi Source Data Analysis
Need of Multi Source Analysis
● If the analysis of the data is restricted to only one
source, then we may lose sight of interesting patterns in
our business
● Complete view / 360 degree view of the business in not
possible unless we consider all the data which is
available to us
● Advance analytics like ML or AI is more useful when
there is more variety in the data
Traditional Approach
● In traditional way of doing multi source analysis, needed
all data to be moved to a single data source
● This approach made sense when number of sources
were few and data was well structured
● With increasing number of sources, the time to ETL
becomes bigger
● Normalizing the data for same schema becomes
challenging for semi-structured sources
● Traditional databases cannot hold the data in volume
also
Data Lake Approach
● Move the data to big data enabled repository from
different sources
● It solves the problem of volume, but there are still
challenges with it
● All the rich schema information in the source may not
translate well to the data lake repository
● ETL time will be still significant
● Will be not able to use underneath source processing
capabilities
● Not good for exploratory analysis
Apache Spark Approach
Requirements
● Ability to load the data uniformly from different source
irrespective their type
● Ability to represent the data in a single format
irrespective of their sources
● Ability to combine the data from the source naturally
● Ability to query the data across the sources naturally
● Ability to use the underneath source processing
whenever possible
Apache Spark Approach
● Data Source API of Spark SQL allows user to load the
uniformly from wide variety of sources
● DataFrame/ Dataset API of Spark allows user to
represent all the data source data uniformly
● Spark SQL has ability to join the data from different
sources
● Spark SQL pushes filters and prune columns if the
underneath source supports it
Customer 360 Use Case
Customer 360
● Four different datasets from two different sources
● We will be using flat file and Mysql data sources
● Transactions - Primarily focuses on Customer information like
Age, Gender, location etc. ( Mysql)
● Demographics - Cost of product, purchase date, store id, store
type, brands, Retail Department, Retail cost(Mysql)
● Credit Information – Reward Member, Redemption Method
● Marketing Information - Ad source, Promotional code
Loading Data
● We are going to use csv and jdbc connector for spark to
load the data
● Due to auto inference of the schema, we will get all the
needed schema in data frame
● After that we are going to preview the data, using show
method
● Ex : MultiSourceLoad
Multi Source Data Model
● We can define a data model using the join of the spark
● Here we will be joining the 4 datasets on customerid as
common
● After join using inner join, we get a data model which
has all the sources combine
● Ex : MultiSourceDataModel
Multi Source Analysis
● Show us the sales by different sources
● Average Cost and Sum Revenue by City and
Department
● Revenue by Campaign
● Ex : MultiSourceDataAnalysis
Introduction to Tellius
About Tellius
Search and AI-powered analytics platform,
enabling anyone to get answers from their business data
using an intuitive search-driven interface and automatically
uncover hidden insights with machine learning
SMART INTUITIVE PERSONALIZED
Customers expect ON-DEMAND , Personalized experience
We live in the era of intelligent consumer apps
Takes days/weeks to get
answers to ad-hoc questions
Time consuming manual process of
analyzing millions of combinations
and charts
No easy way for business users and
analysts to understand, trust and
leverage ML/AI techniques
Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes
So much business data, but very few insights
Tellius is disrupting data analytics with AI
Combining modern search driven user experience with
AI-driven automation to find hidden answers
Tellius Modern Analytics experience
Get Instant answers
Start exploring
Reduce your analysis time from
Hours to Mins
Explainable AI for business
analysts
Time consuming,
Canned reports and dashboards
On-Demand,
Personalized experienceSelf-service data prep
Scalable In-Memory Data Platform
Search-driven
Conversational Analytics
Automated discovery
Of insights
Automated Machine
Learning
Only AI Platform that enables collaboration between roles
DATA MANAGEMENT
Visual Data prep with
SQL/ Python support
VISUAL ANALYSIS
Voice Enabled Search Driven
Interface for Asking Questions
Business User
Data Science
Practitioner
Data Analyst
Data Engineer
DISCOVERY OF INSIGHTS
Augmented discovery of insights
With natural language narrative
MACHINE LEARNING
AutoML and deployment of
ML models with Explainable AI
Google-like Search
driven Conversational
interface
Reveals hidden
relevant insights
saving 1000’s of hours
Eliminating friction
between self service
data prep to
ad-hoc analysis
and explainable
ML models
In-memory
architecture capable
of handling
billions of records
Intuitive UX AI-Driven Automation
Unified Analytics
Experience
Scalable Architecture
Why Tellius?
Only company providing instant Natural language Search experience, surfacing
AI-driven relevant insights across billions of records across data sources at scale and
enabling users to easily create and explain ML/AI models
Business Value Proposition
Automate discovery of relevant
hidden Insights
in your data
Ease of Use Uncover Hidden Insights
Get instant answers with
conversational Search
driven approach
Save Time
Augment Manual discovery process
with automation powered by Machine
learning
Our Vision- Accelerate journey to AI driven Enterprise
CONNECT EXPLORE DISCOVER PREDICT
Customer 360 on Tellius
Loading Data
● Tellius exposes various kind of data sources to connect
using spark data source API
● In this use case, we will using Mysql and csv
connectors to load the data to the system
● Tellius collects the metadata about data as part of the
loading.
● Some of the connectors like Salesforce and Google
Analytics are homegrown using same data source API
Defining Data Model
● Tellius calls data models as business views
● Business view allow user to create data model across
datasets seamlessly
● Internal all datasets in Tellius are represented as spark
Data Frames
● Defining a business view in the Tellius is like defining a
join in spark sql
Multi Source analysis using NLP
● Which top 6 sources by avg revenue
● Hey Tellius what’s my revenue broken down by
department
● show revenue by cit
● show revenue by department for InstagramAds
● These ultimately runs as spark queries and produces
the results
● We can use voice also
Multi Source analysis using Assistant
● Show total revenue
● By city
● What about cost
● for InstagramAds
● Use Voice
● Try out Google Home
Challenges
Spark DataModel
● Spark join creates a flat data model which is different
than typical data ware data model
● So this flat data model is good when there no
duplication of primary keys aka star model
● But if there duplication, we end up double counting
values when we run the queries directly
● Example : DoubleCounting
Handling Double Counting in Tellius
● Tellius has implemented its own query language on top
of the Spark SQL layer to implement data warehouse
like strategies to avoid this double counting
● This layer allows Tellius to provide multi source analysis
on top spark with accuracy of a data warehouse system
● Ex : show point_redemeption_method
References
● Dataset API -
https://www.youtube.com/watch?v=hHFuKeeQujc
● Structured Data Analysis -
https://www.youtube.com/watch?v=0jd3EWmKQfo
● Anatomy of Spark SQL -
https://www.youtube.com/watch?v=TCWOJ6EJprY
We are Hiring!!!
Thank You

More Related Content

What's hot

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
datamantra
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
datamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
datamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
Shashank L
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
datamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
datamantra
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
datamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
datamantra
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Databricks
 

What's hot (20)

Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Improving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time SparkImproving Mobile Payments With Real time Spark
Improving Mobile Payments With Real time Spark
 
Anatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source APIAnatomy of Data Source API : A deep dive into Spark Data source API
Anatomy of Data Source API : A deep dive into Spark Data source API
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Introduction to dataset
Introduction to datasetIntroduction to dataset
Introduction to dataset
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actorsIntroduction to concurrent programming with Akka actors
Introduction to concurrent programming with Akka actors
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Anatomy of in memory processing in Spark
Anatomy of in memory processing in SparkAnatomy of in memory processing in Spark
Anatomy of in memory processing in Spark
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)
 

Similar to Multi Source Data Analysis using Spark and Tellius

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
Rajesh Muppalla
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Lucas Jellema
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
zekeLabs Technologies
 
Delivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphsDelivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphs
Ben Gardner
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
Databricks
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
semanticsconference
 
Enterprise Data Warehousing Positioning
Enterprise Data Warehousing PositioningEnterprise Data Warehousing Positioning
Enterprise Data Warehousing Positioning
EdenH6
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
Provectus
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
Dani Solà Lagares
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
India Quotient
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
Mark Rittman
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
atSistemas
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
Denodo
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
James Serra
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Pentaho
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
WeCloudData
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Daniel Zivkovic
 

Similar to Multi Source Data Analysis using Spark and Tellius (20)

Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
Turn Data into Business Value – Starting with Data Analytics on Oracle Cloud ...
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Delivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphsDelivering a Linked Data warehouse and realising the power of graphs
Delivering a Linked Data warehouse and realising the power of graphs
 
Splice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflowSplice Machine's use of Apache Spark and MLflow
Splice Machine's use of Apache Spark and MLflow
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
 
Enterprise Data Warehousing Positioning
Enterprise Data Warehousing PositioningEnterprise Data Warehousing Positioning
Enterprise Data Warehousing Positioning
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Simply Business' Data Platform
Simply Business' Data PlatformSimply Business' Data Platform
Simply Business' Data Platform
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
From lots of reports (with some data Analysis) 
to Massive Data Analysis (Wit...
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
Microsoft Fabric Introduction
Microsoft Fabric IntroductionMicrosoft Fabric Introduction
Microsoft Fabric Introduction
 
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
Big Data Integration Webinar: Reducing Implementation Efforts of Hadoop, NoSQ...
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Introduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudDataIntroduction to Machine Learning - WeCloudData
Introduction to Machine Learning - WeCloudData
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 

More from datamantra

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
datamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
datamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
datamantra
 

More from datamantra (19)

State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 

Recently uploaded

Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Kiwi Creative
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
nuttdpt
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
wyddcwye1
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
aqzctr7x
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
hyfjgavov
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 

Recently uploaded (20)

Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataPredictably Improve Your B2B Tech Company's Performance by Leveraging Data
Predictably Improve Your B2B Tech Company's Performance by Leveraging Data
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
一比一原版(UCSB文凭证书)圣芭芭拉分校毕业证如何办理
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
原版一比一利兹贝克特大学毕业证(LeedsBeckett毕业证书)如何办理
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
一比一原版兰加拉学院毕业证(Langara毕业证书)学历如何办理
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 

Multi Source Data Analysis using Spark and Tellius

  • 1. Multi Source Data Analysis Using Apache Spark and Tellius https://github.com/phatak-dev/spark2.0-examples
  • 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Multi Source Data ● Challenges with Multi Source ● Traditional and Data Lake Approach ● Spark Approach ● Data Source and Data Frame API ● Tellius Platform ● Multi Source analysis in Tellius
  • 5. Multi Source Data ● In the era of cloud computing and big data, data for analysis can come from various sources ● In every organization, it has become very common to have multiple different sources to store wide variety of storage system ● The nature of the data will vary from source to source ● Data can be structured, semi structured or fully unstructured also.
  • 6. Multi Source Example in Ecommerce ● Relational databases are used to hold product details and customer transactions ● Big data warehousing tools like Hadoop/Hive/Impala are used to store historical transactions and ratings for analytics ● Google analytics to store the website analytics data ● Log Data in S3/ Azure Blog ● Every storage system is optimized to store specific type of data
  • 7. Multi Source Data Analysis
  • 8. Need of Multi Source Analysis ● If the analysis of the data is restricted to only one source, then we may lose sight of interesting patterns in our business ● Complete view / 360 degree view of the business in not possible unless we consider all the data which is available to us ● Advance analytics like ML or AI is more useful when there is more variety in the data
  • 9. Traditional Approach ● In traditional way of doing multi source analysis, needed all data to be moved to a single data source ● This approach made sense when number of sources were few and data was well structured ● With increasing number of sources, the time to ETL becomes bigger ● Normalizing the data for same schema becomes challenging for semi-structured sources ● Traditional databases cannot hold the data in volume also
  • 10. Data Lake Approach ● Move the data to big data enabled repository from different sources ● It solves the problem of volume, but there are still challenges with it ● All the rich schema information in the source may not translate well to the data lake repository ● ETL time will be still significant ● Will be not able to use underneath source processing capabilities ● Not good for exploratory analysis
  • 12. Requirements ● Ability to load the data uniformly from different source irrespective their type ● Ability to represent the data in a single format irrespective of their sources ● Ability to combine the data from the source naturally ● Ability to query the data across the sources naturally ● Ability to use the underneath source processing whenever possible
  • 13. Apache Spark Approach ● Data Source API of Spark SQL allows user to load the uniformly from wide variety of sources ● DataFrame/ Dataset API of Spark allows user to represent all the data source data uniformly ● Spark SQL has ability to join the data from different sources ● Spark SQL pushes filters and prune columns if the underneath source supports it
  • 15. Customer 360 ● Four different datasets from two different sources ● We will be using flat file and Mysql data sources ● Transactions - Primarily focuses on Customer information like Age, Gender, location etc. ( Mysql) ● Demographics - Cost of product, purchase date, store id, store type, brands, Retail Department, Retail cost(Mysql) ● Credit Information – Reward Member, Redemption Method ● Marketing Information - Ad source, Promotional code
  • 16. Loading Data ● We are going to use csv and jdbc connector for spark to load the data ● Due to auto inference of the schema, we will get all the needed schema in data frame ● After that we are going to preview the data, using show method ● Ex : MultiSourceLoad
  • 17. Multi Source Data Model ● We can define a data model using the join of the spark ● Here we will be joining the 4 datasets on customerid as common ● After join using inner join, we get a data model which has all the sources combine ● Ex : MultiSourceDataModel
  • 18. Multi Source Analysis ● Show us the sales by different sources ● Average Cost and Sum Revenue by City and Department ● Revenue by Campaign ● Ex : MultiSourceDataAnalysis
  • 20. About Tellius Search and AI-powered analytics platform, enabling anyone to get answers from their business data using an intuitive search-driven interface and automatically uncover hidden insights with machine learning
  • 21. SMART INTUITIVE PERSONALIZED Customers expect ON-DEMAND , Personalized experience We live in the era of intelligent consumer apps
  • 22. Takes days/weeks to get answers to ad-hoc questions Time consuming manual process of analyzing millions of combinations and charts No easy way for business users and analysts to understand, trust and leverage ML/AI techniques Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes So much business data, but very few insights
  • 23. Tellius is disrupting data analytics with AI Combining modern search driven user experience with AI-driven automation to find hidden answers
  • 24. Tellius Modern Analytics experience Get Instant answers Start exploring Reduce your analysis time from Hours to Mins Explainable AI for business analysts Time consuming, Canned reports and dashboards On-Demand, Personalized experienceSelf-service data prep Scalable In-Memory Data Platform Search-driven Conversational Analytics Automated discovery Of insights Automated Machine Learning
  • 25. Only AI Platform that enables collaboration between roles DATA MANAGEMENT Visual Data prep with SQL/ Python support VISUAL ANALYSIS Voice Enabled Search Driven Interface for Asking Questions Business User Data Science Practitioner Data Analyst Data Engineer DISCOVERY OF INSIGHTS Augmented discovery of insights With natural language narrative MACHINE LEARNING AutoML and deployment of ML models with Explainable AI
  • 26. Google-like Search driven Conversational interface Reveals hidden relevant insights saving 1000’s of hours Eliminating friction between self service data prep to ad-hoc analysis and explainable ML models In-memory architecture capable of handling billions of records Intuitive UX AI-Driven Automation Unified Analytics Experience Scalable Architecture Why Tellius? Only company providing instant Natural language Search experience, surfacing AI-driven relevant insights across billions of records across data sources at scale and enabling users to easily create and explain ML/AI models
  • 27. Business Value Proposition Automate discovery of relevant hidden Insights in your data Ease of Use Uncover Hidden Insights Get instant answers with conversational Search driven approach Save Time Augment Manual discovery process with automation powered by Machine learning
  • 28. Our Vision- Accelerate journey to AI driven Enterprise CONNECT EXPLORE DISCOVER PREDICT
  • 29. Customer 360 on Tellius
  • 30. Loading Data ● Tellius exposes various kind of data sources to connect using spark data source API ● In this use case, we will using Mysql and csv connectors to load the data to the system ● Tellius collects the metadata about data as part of the loading. ● Some of the connectors like Salesforce and Google Analytics are homegrown using same data source API
  • 31. Defining Data Model ● Tellius calls data models as business views ● Business view allow user to create data model across datasets seamlessly ● Internal all datasets in Tellius are represented as spark Data Frames ● Defining a business view in the Tellius is like defining a join in spark sql
  • 32. Multi Source analysis using NLP ● Which top 6 sources by avg revenue ● Hey Tellius what’s my revenue broken down by department ● show revenue by cit ● show revenue by department for InstagramAds ● These ultimately runs as spark queries and produces the results ● We can use voice also
  • 33. Multi Source analysis using Assistant ● Show total revenue ● By city ● What about cost ● for InstagramAds ● Use Voice ● Try out Google Home
  • 35. Spark DataModel ● Spark join creates a flat data model which is different than typical data ware data model ● So this flat data model is good when there no duplication of primary keys aka star model ● But if there duplication, we end up double counting values when we run the queries directly ● Example : DoubleCounting
  • 36. Handling Double Counting in Tellius ● Tellius has implemented its own query language on top of the Spark SQL layer to implement data warehouse like strategies to avoid this double counting ● This layer allows Tellius to provide multi source analysis on top spark with accuracy of a data warehouse system ● Ex : show point_redemeption_method
  • 37. References ● Dataset API - https://www.youtube.com/watch?v=hHFuKeeQujc ● Structured Data Analysis - https://www.youtube.com/watch?v=0jd3EWmKQfo ● Anatomy of Spark SQL - https://www.youtube.com/watch?v=TCWOJ6EJprY