SlideShare a Scribd company logo
Data Profiling and Pipeline
Processing with Spark – A
Journey
Suren Nathan
Synchronoss
(Q3’2014 revenue)
Who am I
• Sr. Director Big Data Platform and Analytics Framework
at Synchronoss
• CTO at Razorsight (acquired by Synchronoss)
• Worked in Analytics and decision support systems for
more than 15 years
• Passionate about solving business problems leveraging
latest technology
(Q3’2014 revenue)
Synchronoss provides Personal Cloud and
Activation Platforms to Tier One Operators,
MSO’s and Enterprises around the globe
Mobile Content
Transfer
Personal
Cloud
Device
Activation
Cloud Account
Provisioning
On-Boarded
Welcome
Synchronoss Integrated Cloud
Products
Online and Device
ACTIVATION
Back-up, Sync
and Share
ACTIVATION CLOUD Internet of Things
Integrated Life
(Q3’2014 revenue)
Synchronoss Connects Operators to
their Customers
Big Data @ Synchronoss
Sample numbers @ one tier1 customer:
• 30M registered users
• 14M monthly active users
• 8M daily active users
• Up to 215TB of ingest per day
• 62PB of content stored
• 50 Billion user content files
• Ingest of 1PB per week
• 4+ Star Rating Apps
What do we do?
• Big Data Analytics Platform Group
• Implement scalable big data technology platform
to help deliver consistent analytics
• Platform deployed in private cloud and AWS
Data Pipeline Process
Ingest Data
Profile Data
Parse Data
Transform
Data
Enrich Data
Aggregate
Data
- Perform
Analysis
- Load
Index Store
- Feed
EDW
Our Data
Pipeline Journey
Data Pipeline – V1
Staging
ETL
EDW
ETL
Process Centric
ETL
Source
Data EDW
Multiple Custom ETLs separated from data layer
SMP architecture not distributed
Long running batch workloads
Contention, Bottlenecks with increased data volume
No support for unstructured data
Cannot retain historical data
$$$
>1 YEAR
Inflexible
Data Pipeline – V2
Staging
ETL
EDW
ETL
Process Centric
ETL
Source
Data EDW
 ETLs closer to data
 High performance, but expensive
 Batch workloads, with reduced latencies
 Unable to handle unstructured data
 Storage costs prohibitive
$$$$
6 Months+
Still
Inflexible
MPP Appliance
Data Pipeline – V3 Option Skipped
Source
Data
 Did not foresee a huge improvement
 Batch workloads only
 Slow performance with MapReduce
 Lack of resources and skills gap
 Lack of consistency
 Too many tools
$$
1 year +
Risks
Data Pipeline – V4
Source
Data
ETLs closer to data
Batch and stream workloads
Superior performance
Abstracted features via Framework
Components and standards
Multiple language support
Simplified design
$
<1 Month
Highly
Flexible
Data Profiling Put all the
data in the
lake man
What’s in
these data
sets?
More data is better.
Work with the
population and not a
sample
-- Data Scientists
Why Data Profiling?
• Find out what is in the data
• Get metrics on data quality
• Assess the risk involved in creating business rules
• Discover metadata including value patterns and
distributions
• Understanding data challenges early to avoid
delays and cost overruns
• Improve the ability to search the data
Analysts spend 80-90% of time in data
munging
Current approaches require multiple
manual touch point and processes
Lost opportunity due to lengthy project
time frames
Business Challenge
Typical Scenario
Data size too large to view using excel & notepad
Data has to be loaded into database for profiling
Cannot load into database unless the data fields are known
File formats are not right and specifications are incorrect
Distribution, space, multiple touch points,
moving files here and there
Too many dependencies, wasted time
What do we need?
Speed, Agility
& Automation
Data Profiler Requirements
Profile data from data lake
Validate and Preview data
Review statistics
Create Meta Data
Create Downstream Schema
Spark to the rescue
 Check the Types
 Check the Values
 Calculate metrics
 Generate MetaData
RDD
….
C1 C2 C3 C4 Cn
RDDsData Files
 Dynamic build execution graph
 Map-> Map
 Built in transformations (unique, get first
etc.,)
 In memory execution provides speed
Execution Flow and Software Stack
Repository
Data Lake location
for data
Data Profiler UI
3
Spark
Data Profiler
1
2
4
5
6.1
6.3
7
6.2
8
MapR FS (M7)
Spark
Spark
Monitoring UI
Spark Data
Profiler
MapR
UI
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
Razorsight Application Level
System Application Level
Legend:
NFS
Meta Data
Repository
WEB Server
Data
Profiler UI
Univariate Statistics
Outputs for Numeric Values Outputs for
Non-Numeric
Values
Histograms
Count of
Missing Values
Count of Non-
Missing Values
Mean
Variance
Standard
Deviation
Minimum
Maximum
Range
Mode
Median
Q1 Value
Q3 Value
Interquartile
Range
Skewness
Kurtosis
Data Profiler Web Application
Meta Data and DDL
Advantages
• Source data in data lake
• All profiling done in the data lake
• No manual movement of data
• Profile sample or full data set
• Integrate creation of meta data for
transformation, enrichment
• Send clean data to downstream processes
Results
• Improved data analysis time from weeks to
hours
• Average improvement of data pipeline process
80%
• Identified data quality issues well ahead of time
• Empowered business analysts to perform the
work
Secure Repository
Data Health | Cleansing | Pruning | Transformation |Univariate Analysis
Descriptive | Predictive | Bivariate | Multivariate
RESTful | SOA
Dashboards | Adhoc Queries | KPIs | Alerts
Data Ingestion
Data Lake
Data Preparation
Data Analytics
Data Services
Data Visualization
Layer 1
Infrastructure
Layer 2
Data
Management
Layer 3
Modeling
Layer 4
Integration
Layer 5
Business Insight
and Actions
Structured | Unstructured | Batch | Streaming
SFT
P
NDM
Nwk
Path
Social
Media
StreamEmail
Framework Layers
Framework Components
Ingestion
Multiple source channels
Batch/Real Time
Data Validation
Compression/Encryption
Profiling
Data Health Check
Summary Statistics
Scrubbing/Cleansing
Meta Data Creation
Parsing
Fixed Width
Delimited
Mapping
Transformation
Enrichment
Truncation
Imputation
Aggregation
Integration
Batch
RESTful
Database
Web Portal
Meta Data Configuration
Tracking
Alerts
Dashboard
Framework Architecture
Processing
Components
Data Storage Layer
Data
Aggregator
Data
Parser &
Transformer
Elastic
Search
Loader
DB Loader
Data
Reconciliation
Orchestration Layer
Elastic
Search
XDF Web UI
Data
Profiler
MySQL
Meta-data Repository
Control Flow
Data Flow
Data
Partitioner
Synchronoss Data Lake
Data
Ingestion
Data
Beacon
External
Data
Sources
Bivariate
Engine
Data Prep
Engine
SQL
Engine
Framework Technology Stack
MapR FS (M7)
Scoop
Apache Spark
Hadoop
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
System Application Level
NFS
UI/Control Cluster
Oozie Apache
Drill
Tomcat
Active
MQ
Spring
Integration
HUE
ElasticSearch
Cluster
NFS
ElasticSearch
Engine
Angular REST
Unix/Linux Unix/Linux Unix/Linux
What’s Next?
• Bivariate Analysis • Multicollinearity
Outputs for Numeric Values
(by target value for each variable)
Correlation
Outputs
Record
Count
Row Count
Percent
Average
Variance
Standard
Deviation
Skewness
Kurtosis
Minimum
Maximum
Pearson’s
Correlation
Coefficient
Spearman’s
Correlation
Coefficient
Covariance
Variable Clustering
Regression
Coefficients
Dendogram
Hierarchical
Cluster
(HCA)
Correlation
Matrix
Variance
Inflation Factor
(VIF)
Lessons
• Let business value drive technology adoption
• Plan incremental updates
• Pay attention to hidden costs
• Simplify
• Implement Framework based development
• Leverage existing skillset to scale
Simplify
THANK YOU.
Suren.nathan@Synchronoss.com

More Related Content

What's hot

Транзакции и блокировки в MySql. Теория и практика
Транзакции и блокировки в MySql. Теория и практикаТранзакции и блокировки в MySql. Теория и практика
Транзакции и блокировки в MySql. Теория и практикаNikolay Gondin
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewRoman Elizarov
 
groovy and concurrency
groovy and concurrencygroovy and concurrency
groovy and concurrency
Paul King
 
Js: master prototypes
Js: master prototypesJs: master prototypes
Js: master prototypes
Barak Drechsler
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
Redis Labs
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
Umair Shahid
 
Test Driven Database Development
Test Driven Database DevelopmentTest Driven Database Development
Test Driven Database Development
David Wheeler
 
從SOA到REST -- Web Service、WCF、WebAPI的應用情境
從SOA到REST -- Web Service、WCF、WebAPI的應用情境從SOA到REST -- Web Service、WCF、WebAPI的應用情境
從SOA到REST -- Web Service、WCF、WebAPI的應用情境
MIS2000 Lab.
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
confluent
 
JQuery
JQueryJQuery
Same Origin Policy Weaknesses
Same Origin Policy WeaknessesSame Origin Policy Weaknesses
Same Origin Policy Weaknesses
kuza55
 

What's hot (11)

Транзакции и блокировки в MySql. Теория и практика
Транзакции и блокировки в MySql. Теория и практикаТранзакции и блокировки в MySql. Теория и практика
Транзакции и блокировки в MySql. Теория и практика
 
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems ReviewACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
ACM ICPC 2013 NEERC (Northeastern European Regional Contest) Problems Review
 
groovy and concurrency
groovy and concurrencygroovy and concurrency
groovy and concurrency
 
Js: master prototypes
Js: master prototypesJs: master prototypes
Js: master prototypes
 
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel HochmanRedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
RedisConf17 - Lyft - Geospatial at Scale - Daniel Hochman
 
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
20230511 - PGConf Nepal - Clustering in PostgreSQL_ Because one database serv...
 
Test Driven Database Development
Test Driven Database DevelopmentTest Driven Database Development
Test Driven Database Development
 
從SOA到REST -- Web Service、WCF、WebAPI的應用情境
從SOA到REST -- Web Service、WCF、WebAPI的應用情境從SOA到REST -- Web Service、WCF、WebAPI的應用情境
從SOA到REST -- Web Service、WCF、WebAPI的應用情境
 
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
KafkaConsumer - Decoupling Consumption and Processing for Better Resource Uti...
 
JQuery
JQueryJQuery
JQuery
 
Same Origin Policy Weaknesses
Same Origin Policy WeaknessesSame Origin Policy Weaknesses
Same Origin Policy Weaknesses
 

Similar to Spark Summit Keynote by Suren Nathan

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
itnewsafrica
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Dataconomy Media
 
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
Satish Bhatia
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT Operations
ExtraHop Networks
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Denodo
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
Skillwise Group
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
Łukasz Grala
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
Skillwise Group
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
Nathan Bijnens
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
Syaifuddin Ismail
 
Airavaat Technologies October 2013
Airavaat Technologies October 2013Airavaat Technologies October 2013
Airavaat Technologies October 2013VenkataGiri Puthigai
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
Denodo
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
Caserta
 
Qiagram
QiagramQiagram
Qiagram
jwppz
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Nathan Bijnens
 
Customer value analysis of big data products
Customer value analysis of big data productsCustomer value analysis of big data products
Customer value analysis of big data products
Vikas Sardana
 
Sabre: Master Reference Data in the Large Enterprise
Sabre: Master Reference Data in the Large EnterpriseSabre: Master Reference Data in the Large Enterprise
Sabre: Master Reference Data in the Large Enterprise
Orchestra Networks
 

Similar to Spark Summit Keynote by Suren Nathan (20)

Big Data Evolution
Big Data EvolutionBig Data Evolution
Big Data Evolution
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
Sören Eickhoff, Informatica GmbH, "Informatica Intelligent Data Lake – Self S...
 
Digital intelligence satish bhatia
Digital intelligence satish bhatiaDigital intelligence satish bhatia
Digital intelligence satish bhatia
 
How to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT OperationsHow to Use Big Data to Transform IT Operations
How to Use Big Data to Transform IT Operations
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Skillwise Big Data part 2
Skillwise Big Data part 2Skillwise Big Data part 2
Skillwise Big Data part 2
 
AzureDay - Introduction Big Data Analytics.
AzureDay  - Introduction Big Data Analytics.AzureDay  - Introduction Big Data Analytics.
AzureDay - Introduction Big Data Analytics.
 
Skilwise Big data
Skilwise Big dataSkilwise Big data
Skilwise Big data
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh ArchitectureArchitect’s Open-Source Guide for a Data Mesh Architecture
Architect’s Open-Source Guide for a Data Mesh Architecture
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 
BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)BI Masterclass slides (Reference Architecture v3)
BI Masterclass slides (Reference Architecture v3)
 
Airavaat Technologies October 2013
Airavaat Technologies October 2013Airavaat Technologies October 2013
Airavaat Technologies October 2013
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Qiagram
QiagramQiagram
Qiagram
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Customer value analysis of big data products
Customer value analysis of big data productsCustomer value analysis of big data products
Customer value analysis of big data products
 
Sabre: Master Reference Data in the Large Enterprise
Sabre: Master Reference Data in the Large EnterpriseSabre: Master Reference Data in the Large Enterprise
Sabre: Master Reference Data in the Large Enterprise
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
enxupq
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 

Recently uploaded (20)

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单一比一原版(YU毕业证)约克大学毕业证成绩单
一比一原版(YU毕业证)约克大学毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 

Spark Summit Keynote by Suren Nathan

  • 1. Data Profiling and Pipeline Processing with Spark – A Journey Suren Nathan Synchronoss
  • 2. (Q3’2014 revenue) Who am I • Sr. Director Big Data Platform and Analytics Framework at Synchronoss • CTO at Razorsight (acquired by Synchronoss) • Worked in Analytics and decision support systems for more than 15 years • Passionate about solving business problems leveraging latest technology
  • 3. (Q3’2014 revenue) Synchronoss provides Personal Cloud and Activation Platforms to Tier One Operators, MSO’s and Enterprises around the globe
  • 5. Online and Device ACTIVATION Back-up, Sync and Share ACTIVATION CLOUD Internet of Things Integrated Life (Q3’2014 revenue) Synchronoss Connects Operators to their Customers
  • 6. Big Data @ Synchronoss Sample numbers @ one tier1 customer: • 30M registered users • 14M monthly active users • 8M daily active users • Up to 215TB of ingest per day • 62PB of content stored • 50 Billion user content files • Ingest of 1PB per week • 4+ Star Rating Apps
  • 7. What do we do? • Big Data Analytics Platform Group • Implement scalable big data technology platform to help deliver consistent analytics • Platform deployed in private cloud and AWS
  • 8. Data Pipeline Process Ingest Data Profile Data Parse Data Transform Data Enrich Data Aggregate Data - Perform Analysis - Load Index Store - Feed EDW
  • 10. Data Pipeline – V1 Staging ETL EDW ETL Process Centric ETL Source Data EDW Multiple Custom ETLs separated from data layer SMP architecture not distributed Long running batch workloads Contention, Bottlenecks with increased data volume No support for unstructured data Cannot retain historical data $$$ >1 YEAR Inflexible
  • 11. Data Pipeline – V2 Staging ETL EDW ETL Process Centric ETL Source Data EDW  ETLs closer to data  High performance, but expensive  Batch workloads, with reduced latencies  Unable to handle unstructured data  Storage costs prohibitive $$$$ 6 Months+ Still Inflexible MPP Appliance
  • 12. Data Pipeline – V3 Option Skipped Source Data  Did not foresee a huge improvement  Batch workloads only  Slow performance with MapReduce  Lack of resources and skills gap  Lack of consistency  Too many tools $$ 1 year + Risks
  • 13. Data Pipeline – V4 Source Data ETLs closer to data Batch and stream workloads Superior performance Abstracted features via Framework Components and standards Multiple language support Simplified design $ <1 Month Highly Flexible
  • 14. Data Profiling Put all the data in the lake man What’s in these data sets? More data is better. Work with the population and not a sample -- Data Scientists
  • 15. Why Data Profiling? • Find out what is in the data • Get metrics on data quality • Assess the risk involved in creating business rules • Discover metadata including value patterns and distributions • Understanding data challenges early to avoid delays and cost overruns • Improve the ability to search the data
  • 16. Analysts spend 80-90% of time in data munging Current approaches require multiple manual touch point and processes Lost opportunity due to lengthy project time frames Business Challenge
  • 17. Typical Scenario Data size too large to view using excel & notepad Data has to be loaded into database for profiling Cannot load into database unless the data fields are known File formats are not right and specifications are incorrect Distribution, space, multiple touch points, moving files here and there Too many dependencies, wasted time
  • 18. What do we need? Speed, Agility & Automation
  • 19. Data Profiler Requirements Profile data from data lake Validate and Preview data Review statistics Create Meta Data Create Downstream Schema
  • 20. Spark to the rescue  Check the Types  Check the Values  Calculate metrics  Generate MetaData RDD …. C1 C2 C3 C4 Cn RDDsData Files  Dynamic build execution graph  Map-> Map  Built in transformations (unique, get first etc.,)  In memory execution provides speed
  • 21. Execution Flow and Software Stack Repository Data Lake location for data Data Profiler UI 3 Spark Data Profiler 1 2 4 5 6.1 6.3 7 6.2 8 MapR FS (M7) Spark Spark Monitoring UI Spark Data Profiler MapR UI MapR Cluster Hardware Infrastructure Level OS/File System Level Razorsight Application Level System Application Level Legend: NFS Meta Data Repository WEB Server Data Profiler UI
  • 22. Univariate Statistics Outputs for Numeric Values Outputs for Non-Numeric Values Histograms Count of Missing Values Count of Non- Missing Values Mean Variance Standard Deviation Minimum Maximum Range Mode Median Q1 Value Q3 Value Interquartile Range Skewness Kurtosis
  • 23. Data Profiler Web Application
  • 25. Advantages • Source data in data lake • All profiling done in the data lake • No manual movement of data • Profile sample or full data set • Integrate creation of meta data for transformation, enrichment • Send clean data to downstream processes
  • 26. Results • Improved data analysis time from weeks to hours • Average improvement of data pipeline process 80% • Identified data quality issues well ahead of time • Empowered business analysts to perform the work
  • 27. Secure Repository Data Health | Cleansing | Pruning | Transformation |Univariate Analysis Descriptive | Predictive | Bivariate | Multivariate RESTful | SOA Dashboards | Adhoc Queries | KPIs | Alerts Data Ingestion Data Lake Data Preparation Data Analytics Data Services Data Visualization Layer 1 Infrastructure Layer 2 Data Management Layer 3 Modeling Layer 4 Integration Layer 5 Business Insight and Actions Structured | Unstructured | Batch | Streaming SFT P NDM Nwk Path Social Media StreamEmail Framework Layers
  • 28. Framework Components Ingestion Multiple source channels Batch/Real Time Data Validation Compression/Encryption Profiling Data Health Check Summary Statistics Scrubbing/Cleansing Meta Data Creation Parsing Fixed Width Delimited Mapping Transformation Enrichment Truncation Imputation Aggregation Integration Batch RESTful Database Web Portal Meta Data Configuration Tracking Alerts Dashboard
  • 29. Framework Architecture Processing Components Data Storage Layer Data Aggregator Data Parser & Transformer Elastic Search Loader DB Loader Data Reconciliation Orchestration Layer Elastic Search XDF Web UI Data Profiler MySQL Meta-data Repository Control Flow Data Flow Data Partitioner Synchronoss Data Lake Data Ingestion Data Beacon External Data Sources Bivariate Engine Data Prep Engine SQL Engine
  • 30. Framework Technology Stack MapR FS (M7) Scoop Apache Spark Hadoop MapR Cluster Hardware Infrastructure Level OS/File System Level System Application Level NFS UI/Control Cluster Oozie Apache Drill Tomcat Active MQ Spring Integration HUE ElasticSearch Cluster NFS ElasticSearch Engine Angular REST Unix/Linux Unix/Linux Unix/Linux
  • 31. What’s Next? • Bivariate Analysis • Multicollinearity Outputs for Numeric Values (by target value for each variable) Correlation Outputs Record Count Row Count Percent Average Variance Standard Deviation Skewness Kurtosis Minimum Maximum Pearson’s Correlation Coefficient Spearman’s Correlation Coefficient Covariance Variable Clustering Regression Coefficients Dendogram Hierarchical Cluster (HCA) Correlation Matrix Variance Inflation Factor (VIF)
  • 32. Lessons • Let business value drive technology adoption • Plan incremental updates • Pay attention to hidden costs • Simplify • Implement Framework based development • Leverage existing skillset to scale

Editor's Notes

  1. How can data be used? Does the data conform to the structure, standards or patterns c. challenges of joins and integration d. Identify key candidates, foreign-key candidates, and functional dependencies e. Identify enrichment rules for better search by assigning it to a category