Spark Summit Keynote by Suren Nathan

Data Profiling and Pipeline
Processing with Spark – A
Journey
Suren Nathan
Synchronoss

(Q3’2014 revenue)
Who am I
• Sr. Director Big Data Platform and Analytics Framework
at Synchronoss
• CTO at Razorsight (acquired by Synchronoss)
• Worked in Analytics and decision support systems for
more than 15 years
• Passionate about solving business problems leveraging
latest technology

(Q3’2014 revenue)
Synchronoss provides Personal Cloud and
Activation Platforms to Tier One Operators,
MSO’s and Enterprises around the globe

Mobile Content
Transfer
Personal
Cloud
Device
Activation
Cloud Account
Provisioning
On-Boarded
Welcome
Synchronoss Integrated Cloud
Products

Online and Device
ACTIVATION
Back-up, Sync
and Share
ACTIVATION CLOUD Internet of Things
Integrated Life
(Q3’2014 revenue)
Synchronoss Connects Operators to
their Customers

Big Data @ Synchronoss
Sample numbers @ one tier1 customer:
• 30M registered users
• 14M monthly active users
• 8M daily active users
• Up to 215TB of ingest per day
• 62PB of content stored
• 50 Billion user content files
• Ingest of 1PB per week
• 4+ Star Rating Apps

What do we do?
• Big Data Analytics Platform Group
• Implement scalable big data technology platform
to help deliver consistent analytics
• Platform deployed in private cloud and AWS

Data Pipeline Process
Ingest Data
Profile Data
Parse Data
Transform
Data
Enrich Data
Aggregate
Data
- Perform
Analysis
- Load
Index Store
- Feed
EDW

Data Pipeline – V1
Staging
ETL
EDW
ETL
Process Centric
ETL
Source
Data EDW
Multiple Custom ETLs separated from data layer
SMP architecture not distributed
Long running batch workloads
Contention, Bottlenecks with increased data volume
No support for unstructured data
Cannot retain historical data
$$$
>1 YEAR
Inflexible

Staging
ETL
EDW
ETL
Process Centric
ETL
Source
Data EDW
 ETLs closer to data
 High performance, but expensive
 Batch workloads, with reduced latencies
 Unable to handle unstructured data
 Storage costs prohibitive
$$$$
6 Months+
Still
Inflexible
MPP Appliance

Data Pipeline – V3 Option Skipped
Source
Data
 Did not foresee a huge improvement
 Batch workloads only
 Slow performance with MapReduce
 Lack of resources and skills gap
 Lack of consistency
 Too many tools
$$
1 year +
Risks

Source
Data
ETLs closer to data
Batch and stream workloads
Superior performance
Abstracted features via Framework
Components and standards
Multiple language support
Simplified design
$
<1 Month
Highly
Flexible

Data Profiling Put all the
data in the
lake man
What’s in
these data
sets?
More data is better.
Work with the
population and not a
sample
-- Data Scientists

Why Data Profiling?
• Find out what is in the data
• Get metrics on data quality
• Assess the risk involved in creating business rules
• Discover metadata including value patterns and
distributions
• Understanding data challenges early to avoid
delays and cost overruns
• Improve the ability to search the data

Analysts spend 80-90% of time in data
munging
Current approaches require multiple
manual touch point and processes
Lost opportunity due to lengthy project
time frames
Business Challenge

Typical Scenario
Data size too large to view using excel & notepad
Data has to be loaded into database for profiling
Cannot load into database unless the data fields are known
File formats are not right and specifications are incorrect
Distribution, space, multiple touch points,
moving files here and there
Too many dependencies, wasted time

What do we need?
Speed, Agility
& Automation

Data Profiler Requirements
Profile data from data lake
Validate and Preview data
Review statistics
Create Meta Data
Create Downstream Schema

Spark to the rescue
 Check the Types
 Check the Values
 Calculate metrics
 Generate MetaData
RDD
….
C1 C2 C3 C4 Cn
RDDsData Files
 Dynamic build execution graph
 Map-> Map
 Built in transformations (unique, get first
etc.,)
 In memory execution provides speed

Execution Flow and Software Stack
Repository
Data Lake location
for data
Data Profiler UI
3
Spark
Data Profiler
1
2
4
5
6.1
6.3
7
6.2
8
MapR FS (M7)
Spark
Spark
Monitoring UI
Spark Data
Profiler
MapR
UI
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
Razorsight Application Level
System Application Level
Legend:
NFS
Meta Data
Repository
WEB Server
Data
Profiler UI

Univariate Statistics
Outputs for Numeric Values Outputs for
Non-Numeric
Values
Histograms
Count of
Missing Values
Count of Non-
Missing Values
Mean
Variance
Standard
Deviation
Minimum
Maximum
Range
Mode
Median
Q1 Value
Q3 Value
Interquartile
Range
Skewness
Kurtosis

Advantages
• Source data in data lake
• All profiling done in the data lake
• No manual movement of data
• Profile sample or full data set
• Integrate creation of meta data for
transformation, enrichment
• Send clean data to downstream processes

Results
• Improved data analysis time from weeks to
hours
• Average improvement of data pipeline process
80%
• Identified data quality issues well ahead of time
• Empowered business analysts to perform the
work

Framework Components
Ingestion
Multiple source channels
Batch/Real Time
Data Validation
Compression/Encryption
Profiling
Data Health Check
Summary Statistics
Scrubbing/Cleansing
Meta Data Creation
Parsing
Fixed Width
Delimited
Mapping
Transformation
Enrichment
Truncation
Imputation
Aggregation
Integration
Batch
RESTful
Database
Web Portal
Meta Data Configuration
Tracking
Alerts
Dashboard

Framework Architecture
Processing
Components
Data Storage Layer
Data
Aggregator
Data
Parser &
Transformer
Elastic
Search
Loader
DB Loader
Data
Reconciliation
Orchestration Layer
Elastic
Search
XDF Web UI
Data
Profiler
MySQL
Meta-data Repository
Control Flow
Data Flow
Data
Partitioner
Synchronoss Data Lake
Data
Ingestion
Data
Beacon
External
Data
Sources
Bivariate
Engine
Data Prep
Engine
SQL
Engine

Framework Technology Stack
MapR FS (M7)
Scoop
Apache Spark
Hadoop
MapR Cluster
Hardware Infrastructure Level
OS/File System Level
System Application Level
NFS
UI/Control Cluster
Oozie Apache
Drill
Tomcat
Active
MQ
Spring
Integration
HUE
ElasticSearch
Cluster
NFS
ElasticSearch
Engine
Angular REST
Unix/Linux Unix/Linux Unix/Linux

What’s Next?
• Bivariate Analysis • Multicollinearity
Outputs for Numeric Values
(by target value for each variable)
Correlation
Outputs
Record
Count
Row Count
Percent
Average
Variance
Standard
Deviation
Skewness
Kurtosis
Minimum
Maximum
Pearson’s
Correlation
Coefficient
Spearman’s
Correlation
Coefficient
Covariance
Variable Clustering
Regression
Coefficients
Dendogram
Hierarchical
Cluster
(HCA)
Correlation
Matrix
Variance
Inflation Factor
(VIF)

Lessons
• Let business value drive technology adoption
• Plan incremental updates
• Pay attention to hidden costs
• Simplify
• Implement Framework based development
• Leverage existing skillset to scale

THANK YOU.
Suren.nathan@Synchronoss.com

Spark Summit Keynote by Suren Nathan

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Spark Summit Keynote by Suren Nathan

Similar to Spark Summit Keynote by Suren Nathan (20)

More from Spark Summit

More from Spark Summit (20)

Recently uploaded

Recently uploaded (20)

Spark Summit Keynote by Suren Nathan

Editor's Notes