Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Multi Source Data Analysis using Spark and Tellius

175 views

Published on

Multi Source Data Analysis using Spark and Tellius

Published in: Data & Analytics
  • Be the first to comment

Multi Source Data Analysis using Spark and Tellius

  1. 1. Multi Source Data Analysis Using Apache Spark and Tellius https://github.com/phatak-dev/spark2.0-examples
  2. 2. ● Madhukara Phatak ● Director of Engineering,Tellius ● Work on Hadoop, Spark , ML and Scala ● www.madhukaraphatak.com
  3. 3. Agenda ● Multi Source Data ● Challenges with Multi Source ● Traditional and Data Lake Approach ● Spark Approach ● Data Source and Data Frame API ● Tellius Platform ● Multi Source analysis in Tellius
  4. 4. Multi Source Data
  5. 5. Multi Source Data ● In the era of cloud computing and big data, data for analysis can come from various sources ● In every organization, it has become very common to have multiple different sources to store wide variety of storage system ● The nature of the data will vary from source to source ● Data can be structured, semi structured or fully unstructured also.
  6. 6. Multi Source Example in Ecommerce ● Relational databases are used to hold product details and customer transactions ● Big data warehousing tools like Hadoop/Hive/Impala are used to store historical transactions and ratings for analytics ● Google analytics to store the website analytics data ● Log Data in S3/ Azure Blog ● Every storage system is optimized to store specific type of data
  7. 7. Multi Source Data Analysis
  8. 8. Need of Multi Source Analysis ● If the analysis of the data is restricted to only one source, then we may lose sight of interesting patterns in our business ● Complete view / 360 degree view of the business in not possible unless we consider all the data which is available to us ● Advance analytics like ML or AI is more useful when there is more variety in the data
  9. 9. Traditional Approach ● In traditional way of doing multi source analysis, needed all data to be moved to a single data source ● This approach made sense when number of sources were few and data was well structured ● With increasing number of sources, the time to ETL becomes bigger ● Normalizing the data for same schema becomes challenging for semi-structured sources ● Traditional databases cannot hold the data in volume also
  10. 10. Data Lake Approach ● Move the data to big data enabled repository from different sources ● It solves the problem of volume, but there are still challenges with it ● All the rich schema information in the source may not translate well to the data lake repository ● ETL time will be still significant ● Will be not able to use underneath source processing capabilities ● Not good for exploratory analysis
  11. 11. Apache Spark Approach
  12. 12. Requirements ● Ability to load the data uniformly from different source irrespective their type ● Ability to represent the data in a single format irrespective of their sources ● Ability to combine the data from the source naturally ● Ability to query the data across the sources naturally ● Ability to use the underneath source processing whenever possible
  13. 13. Apache Spark Approach ● Data Source API of Spark SQL allows user to load the uniformly from wide variety of sources ● DataFrame/ Dataset API of Spark allows user to represent all the data source data uniformly ● Spark SQL has ability to join the data from different sources ● Spark SQL pushes filters and prune columns if the underneath source supports it
  14. 14. Customer 360 Use Case
  15. 15. Customer 360 ● Four different datasets from two different sources ● We will be using flat file and Mysql data sources ● Transactions - Primarily focuses on Customer information like Age, Gender, location etc. ( Mysql) ● Demographics - Cost of product, purchase date, store id, store type, brands, Retail Department, Retail cost(Mysql) ● Credit Information – Reward Member, Redemption Method ● Marketing Information - Ad source, Promotional code
  16. 16. Loading Data ● We are going to use csv and jdbc connector for spark to load the data ● Due to auto inference of the schema, we will get all the needed schema in data frame ● After that we are going to preview the data, using show method ● Ex : MultiSourceLoad
  17. 17. Multi Source Data Model ● We can define a data model using the join of the spark ● Here we will be joining the 4 datasets on customerid as common ● After join using inner join, we get a data model which has all the sources combine ● Ex : MultiSourceDataModel
  18. 18. Multi Source Analysis ● Show us the sales by different sources ● Average Cost and Sum Revenue by City and Department ● Revenue by Campaign ● Ex : MultiSourceDataAnalysis
  19. 19. Introduction to Tellius
  20. 20. About Tellius Search and AI-powered analytics platform, enabling anyone to get answers from their business data using an intuitive search-driven interface and automatically uncover hidden insights with machine learning
  21. 21. SMART INTUITIVE PERSONALIZED Customers expect ON-DEMAND , Personalized experience We live in the era of intelligent consumer apps
  22. 22. Takes days/weeks to get answers to ad-hoc questions Time consuming manual process of analyzing millions of combinations and charts No easy way for business users and analysts to understand, trust and leverage ML/AI techniques Low Analytics adoption Analysis process not scalable Trust with AI for business outcomes So much business data, but very few insights
  23. 23. Tellius is disrupting data analytics with AI Combining modern search driven user experience with AI-driven automation to find hidden answers
  24. 24. Tellius Modern Analytics experience Get Instant answers Start exploring Reduce your analysis time from Hours to Mins Explainable AI for business analysts Time consuming, Canned reports and dashboards On-Demand, Personalized experienceSelf-service data prep Scalable In-Memory Data Platform Search-driven Conversational Analytics Automated discovery Of insights Automated Machine Learning
  25. 25. Only AI Platform that enables collaboration between roles DATA MANAGEMENT Visual Data prep with SQL/ Python support VISUAL ANALYSIS Voice Enabled Search Driven Interface for Asking Questions Business User Data Science Practitioner Data Analyst Data Engineer DISCOVERY OF INSIGHTS Augmented discovery of insights With natural language narrative MACHINE LEARNING AutoML and deployment of ML models with Explainable AI
  26. 26. Google-like Search driven Conversational interface Reveals hidden relevant insights saving 1000’s of hours Eliminating friction between self service data prep to ad-hoc analysis and explainable ML models In-memory architecture capable of handling billions of records Intuitive UX AI-Driven Automation Unified Analytics Experience Scalable Architecture Why Tellius? Only company providing instant Natural language Search experience, surfacing AI-driven relevant insights across billions of records across data sources at scale and enabling users to easily create and explain ML/AI models
  27. 27. Business Value Proposition Automate discovery of relevant hidden Insights in your data Ease of Use Uncover Hidden Insights Get instant answers with conversational Search driven approach Save Time Augment Manual discovery process with automation powered by Machine learning
  28. 28. Our Vision- Accelerate journey to AI driven Enterprise CONNECT EXPLORE DISCOVER PREDICT
  29. 29. Customer 360 on Tellius
  30. 30. Loading Data ● Tellius exposes various kind of data sources to connect using spark data source API ● In this use case, we will using Mysql and csv connectors to load the data to the system ● Tellius collects the metadata about data as part of the loading. ● Some of the connectors like Salesforce and Google Analytics are homegrown using same data source API
  31. 31. Defining Data Model ● Tellius calls data models as business views ● Business view allow user to create data model across datasets seamlessly ● Internal all datasets in Tellius are represented as spark Data Frames ● Defining a business view in the Tellius is like defining a join in spark sql
  32. 32. Multi Source analysis using NLP ● Which top 6 sources by avg revenue ● Hey Tellius what’s my revenue broken down by department ● show revenue by cit ● show revenue by department for InstagramAds ● These ultimately runs as spark queries and produces the results ● We can use voice also
  33. 33. Multi Source analysis using Assistant ● Show total revenue ● By city ● What about cost ● for InstagramAds ● Use Voice ● Try out Google Home
  34. 34. Challenges
  35. 35. Spark DataModel ● Spark join creates a flat data model which is different than typical data ware data model ● So this flat data model is good when there no duplication of primary keys aka star model ● But if there duplication, we end up double counting values when we run the queries directly ● Example : DoubleCounting
  36. 36. Handling Double Counting in Tellius ● Tellius has implemented its own query language on top of the Spark SQL layer to implement data warehouse like strategies to avoid this double counting ● This layer allows Tellius to provide multi source analysis on top spark with accuracy of a data warehouse system ● Ex : show point_redemeption_method
  37. 37. References ● Dataset API - https://www.youtube.com/watch?v=hHFuKeeQujc ● Structured Data Analysis - https://www.youtube.com/watch?v=0jd3EWmKQfo ● Anatomy of Spark SQL - https://www.youtube.com/watch?v=TCWOJ6EJprY
  38. 38. We are Hiring!!!
  39. 39. Thank You

×