Dataworks | 2018-06-20 | Gimel data platform

Agenda
©2018 PayPal Inc. Confidential and proprietary. 2
• Introduction
• PayPal’s Analytics Ecosystem
• Why Gimel
• Challenges in Analytics
• Walk through simple use case
• Gimel Open Source Journey

About Us
• Product manager, data processing products
at PayPal
• 20 years in data and analytics across
networking, semi-conductors, telecom,
security and fintech industries
• Data warehouse developer, BI program
manager, Data product manager
romehta@paypal.com
https://www.linkedin.com/in/romit-mehta/
Romit Mehta
• Big data platform engineer at PayPal
• 13 years in data engineering, 5 years in
scalable solutions with big data
• Developed several Spark-based solutions
across NoSQL, Key-Value, Messaging,
Document based & relational systems
dmohanakumarchan@paypal.com
https://www.linkedin.com/in/deepakmc/
Deepak Mohanakumar Chandramouli

PayPal – Key Metrics and Analytics
Ecosystem
4©2018 PayPal Inc. Confidential and proprietary.

PayPal Big Data Platform
5
160+ PB Data
75,000+ YARN
jobs/day
One of the largest
Aerospike,
Teradata,
Hortonworks and
Oracle installations
Compute supported:
MR, Pig, Hive, Spark,
Beam
13 prod clusters, 12 non-
prod clusters
GPU co-located with
Hadoop

6
Developer Data scientist Analyst Operator
Gimel SDK Notebooks
PCatalog Data API
Infrastructure services leveraged for elasticity and redundancy
Multi-DC Public cloudPredictive resource allocation
Logging
Monitoring
Alerting
Security
Application
Lifecycle
Management
Compute
Frameworkand
APIs
GimelData
Platform
User
Experience
andAccess
R Studio BI tools

9
Kafka Teradata External
HDFS / Hive
Data Prep / Availability
ProcessStream Ingest LoadExtract/Load
Parquet/ORC/Text?
Productionalize, Logging, Monitoring, Alerting, Auditing, Data Quality
Data SourcesData Points
Flights Events
Airports
Airlines
Carrier
Geography & Geo
Tags
Publish
Use case challenges
…
©2018 PayPal Inc. Confidential and proprietary.
Analysis
Real-time/
processed data

Spark Read From Hbase
Data Access Code is Cumbersome and Fragile

Spark Read From Hbase Spark Read From Elastic Search
Spark Read From AeroSpike Spark Read From Druid
Data Access Code is Cumbersome and Fragile

Datasets Challenges
Data access tied
to compute and
data store
versions
Hard to find
available
data sets
Storage-specific
dataset creation
results in
duplication and
increased latency
No audit
trail for
dataset
access
No standards for
on-boarding data
sets for others to
discover
No statistics
on data set
usage and
access trends
Datasets

High-friction Data Application Lifecycle
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
Learn Code Optimize Build Deploy RunCompute Engine Changed
Learn Code Optimize Build Deploy RunCompute Version Upgraded
Learn Code Optimize Build Deploy RunStorage API Changed
Learn Code Optimize Build Deploy RunStorage Connector Upgraded
Learn Code Optimize Build Deploy RunStorage Hosts Migrated
Learn Code Optimize Build Deploy RunStorage Changed
Learn Code Optimize Build Deploy Run*********************

15
API, PCatalog, Tools
With Gimel & Notebooks
Kafka Teradata External
HDFS/ Hive
Data Prep / Availability
ProcessIngest LoadExtract/Load
Parquet/ORC/Text?
Productionalize, Logging, Monitoring, Alerting, Auditing, Data QC
Data SourcesData Points
Flights Events
Airports
Airlines
Carrier
Geography & Geo Tags
Analysis Publish
Use case challenges - Simplified with Gimel

With Data API
✔
Data Access Simplified with Gimel Data API
16

With Data API
✔
SQL Support in Gimel Data Platform
17

Data Application Lifecycle with Data API
Learn Code Optimize Build Deploy RunOnboarding Big Data Apps
RunCompute Engine Changed
Compute Version Upgraded
Storage API Changed
Storage Connector Upgraded
Storage Hosts Migrated
Storage Changed
*********************
Run
Run
Run
Run
Run
Run

Open Source

Gimel Open Source Journey
• Open source Gimel
PCatalog:
• Metadata
services
• Discovery
services
• Catalog UI
• Open source Compute
Framework (SCaaS)
• Livy features and
enhancements
• Monitoring and
alerting
• SDK and Gimel
integration
• Open source PayPal
Notebooks
• Jupyter features
and enhancements
• Gimel integration
• Open sourced Gimel
Data API in April 2018
(http://try.gimel.io)

Gimel - Open Sourced
Codebase available: https://github.com/paypal/gimel
Slack: https://gimel-dev.slack.com
Google Groups: https://groups.google.com/d/forum/gimel-dev

Q&A
G i t h u b : h t t p : / / g i m e l . i o
Tr y i t y o u r s e l f : h t t p : / / t r y. g i m e l . i o
S l a c k : h t t p s : / / g i m e l - d e v. s l a c k . c o m
G o o g l e G r o u p s : h t t p s : / / g r o u p s . g o o g l e . c o m / d / f o r u m / g i m e l - d e v
22

Job
LIVY GRID
Job Server
Batch
Livy
API
NAS
Batch
In InIn
Interactive
Sparkling
Water
Interactive
Interactive
Metrics
History Server
Thrift Server
In InIn
Interactive
Interactive
Log
Log
Indexing
Search
xDiscovery
Maintain
Catalog
Scan
Discover
Metadata
Services
PCatalog UI
Explore
Configure
Log
Indexing
Search
PayPal Analytics Ecosystem

A peek into
Streaming SQL
Launches … Spark Streaming App
--StreamingWindowSeconds
setgimel.kafka.throttle.streaming.window.seconds=10;
--Throttling
setgimel.kafka.throttle.streaming.maxRatePerPartition=1500;
--ZK checkpoint rootpath
setgimel.kafka.consumer.checkpoint.root=/checkpoints/appname;
--Checkpoint enablingflag -implicitlycheckpoints aftereach mini-batch in streaming
setgimel.kafka.reader.checkpoint.save.enabled=true;
--Jupyter MagicforstreamingSQLon Notebooks | Interactive Usecases
--LivyREPL-Same magicforstreamingSQLworks | Streaming Usecases
%%gimel-stream
--AssumePre-SplitHBASETable as anexample
insertintopcatalog.HBASE_dataset
select
cust_id,
kafka_ds.*
frompcatalog.KAFKA_dataset kafka_ds;
Batch SQL
Launches … Spark Batch App
--Establish10 concurrent connections perTopic-Partition
setgimel.kafka.throttle.batch.parallelsPerPartition=10;
--Fetchat max-10 M messagesfromeach partition
setgimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;
--Jupyter Magicon Notebooks | Interactive Usecases
--LivyREPL-Same magicworks| Batch Usecases
%%gimel
insertintopcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
selectkafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4)as yyyy
,substr(commit_timestamp,6,2)as mm
,substr(commit_timestamp,9,2)as dd
,substr(commit_timestamp,12,2)as hh
,case when cast(substr(commit_timestamp,15,2)asINT) <= 30then "00" else "30" end asmi
from pcatalog.KAFKA_dataset kafka_ds;
Following are Jupyter/Livy Magic terms
• %%gimel : calls gimel.executeBatch(sql)
• %%gimel-stream : calls
gimel.executeStream(sql)

gimel.dataset.factory {
KafkaDataSet
ElasticSearchDataSet
DruidDataSet
HiveDataSet
AerospikeDataSet
HbaseDataSet
CassandraDataSet
JDBCDataSet
}
Metadata
Services
dataSet.read(“dataSetName”,options)
dataSet.write(dataToWrite,”dataSetName”,options)
dataStream.read(“dataSetName”, options)
valstorageDataSet =getFromFactory(type=“Hive”)
{
Core Connector Implementation, example –Kafka
Combination ofOpen SourceConnector and
In-house implementations
Open source connector such asDataStax/SHC /ES-Spark
}
& Anatomy of API
gimel.datastream.factory{
KafkaDataStream
}
CatalogProvider.getDataSetProperties(“dataSetName”)
valstorageDataStream= getFromStreamFactory(type=“kafka”)
kafkaDataSet.read(“dataSetName”,options)
hiveDataSet.write(dataToWrite,”dataSetName”,options)
storageDataStream.read(“dataSetName”,options)
dataSet.write(”pcatalog.HIVE_dataset”,readDf, options)
val dataSet :gimel.DataSet =DataSet(sparkSession)
valdf1 =dataSet.read(“pcatalog.KAFKA_dataset”, options);
df1.createGlobalTempView(“tmp_abc123”)
Val resolvedSelectSQL= selectSQL.replace(“pcatalog.KAFKA_dataset”,”tmp_abc123”)
Val readDf : DataFrame= sparkSession.sql(resolvedSelectSQL);
selectkafka_ds.*,gimel_load_id
,substr(commit_timestamp,1,4)as yyyy
,substr(commit_timestamp,6,2)as mm
,substr(commit_timestamp,9,2)as dd
,substr(commit_timestamp,12,2)as hh
frompcatalog.KAFKA_dataset kafka_ds
join default.geo_lkp lkp
on kafka_ds.zip =geo_lkp.zip
where geo_lkp.region = ‘MIDWEST’
%%gimel
insertintopcatalog.HIVE_dataset
partition(yyyy,mm,dd,hh,mi)
--Establish10 concurrent connections perTopic-Partition
setgimel.kafka.throttle.batch.parallelsPerPartition=10;
--Fetch at max -10 M messagesfromeach partition
setgimel.kafka.throttle.batch.maxRecordsPerPartition=10,000,000;

Setgimel.catalog.provider=PCATALOG
Metadata
Services
Setgimel.catalog.provider=USER
Setgimel.catalog.provider=HIVE
sql> set dataSetProperties={
"key.deserializer":"org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset":"earliest",
"gimel.kafka.checkpoint.zookeeper.host":"zookeeper:2181",
"gimel.storage.type":"kafka",
"gimel.kafka.whitelist.topics":"kafka_topic",
"datasetName":"test_table1",
"value.deserializer":"org.apache.kafka.common.serialization.ByteArrayDeserialize
r",
"value.serializer":"org.apache.kafka.common.serialization.ByteArraySerializer",
"gimel.kafka.checkpoint.zookeeper.path":"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.avro.schema.source":"CSR",
"gimel.kafka.zookeeper.connection.timeout.ms":"10000",
"gimel.kafka.avro.schema.source.url":"http://schema_registry:8081",
"key.serializer":"org.apache.kafka.common.serialization.StringSerializer",
"gimel.kafka.avro.schema.source.wrapper.key":"schema_registry_key",
"gimel.kafka.bootstrap.servers":"localhost:9092"
}
sql> Select * from pcatalog.test_table1.
spark.sql("set gimel.catalog.provider=USER");
val dataSetOptions = DataSetProperties(
"KAFKA",
Array(Field("payload","string",true)) ,
Array(),
Map(
"datasetName" -> "test_table1",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"-> "org.apache.kafka.common.serialization.ByteArraySerializer"
)
)
dataSet.read(”test_table1",Map("dataSetProperties"->dataSetOptions))
CREATE EXTERNAL TABLE `pcatalog.test_table1`
(payload string)
LOCATION 'hdfs://tmp/'
TBLPROPERTIES (
"datasetName" -> "dummy",
"auto.offset.reset"-> "earliest",
"gimel.kafka.bootstrap.servers"-> "localhost:9092",
"gimel.kafka.avro.schema.source"-> "CSR",
"gimel.kafka.avro.schema.source.url"-> "http://schema_registry:8081",
"gimel.kafka.avro.schema.source.wrapper.key"-> "schema_registry_key",
"gimel.kafka.checkpoint.zookeeper.host"-> "zookeeper:2181",
"gimel.kafka.checkpoint.zookeeper.path"->
"/pcatalog/kafka_consumer/checkpoint",
"gimel.kafka.whitelist.topics"-> "kafka_topic",
"gimel.kafka.zookeeper.connection.timeout.ms"-> "10000",
"gimel.storage.type"-> "kafka",
"key.serializer"-> "org.apache.kafka.common.serialization.StringSerializer",
"value.serializer"->
"org.apache.kafka.common.serialization.ByteArraySerializer"
);
Spark-sql> Select * from pcatalog.test_table1
Scala> dataSet.read(”test_table1",Map("dataSetProperties"-
>dataSetOptions))
Catalog Provider – USER | HIVE | PCATALOG | Your Own Catalog
Metadata
Setgimel.catalog.provider=YOUR_CATALOG
{
//Implement this!
}

Spark Thrift Server
org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.sc
ala
//result = sqlContext.sql(statement)  Original SQL Execution
//Integration of Gimel in Spark
result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession)
Integration with ecosystems
class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) {
private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r
private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r
private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r
// ........
// .....
case PCATALOG_BATCH_MAGIC(gimelCode) => GimelQueryProcessor.executeBatch(gimelCode,
sparkSession)
case PCATALOG_STREAM_MAGIC(gimelCode) => GimelQueryProcessor.executeStream(gimelCode,
sparkSession)
case _ =>
// ........
// .....
com/cloudera/livy/repl/SparkSqlInterpreter.scala
Livy REPL
sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js
define(['base/js/namespace'], function(IPython){
var onload = function() {
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-sql'] =
{'reg':[/^%%gimel/]};}
return { onload: onload }})
Jupyter Notebooks

Data Stores Supported
Systems
REST datasets

Acknowledgements
Gimel and PayPal Notebooks team:
Andrew Alves
Anisha Nainani
Ayushi Agarwal
Baskaran Gopalan
Dheeraj Rampally
Deepak Chandramouli
Laxmikant Patil
Meisam Fathi Salmi
Prabhu Kasinathan
Praveen Kanamarlapudi
Romit Mehta
Thilak Balasubramanian
Weijun Qian
31

Appendix

References Used
Images Referred :
https://www.google.com/search?q=big+data+stack+images&source=lnms&tbm=isch&sa=X&ved=0ahUKEwip1Jz3voPaAhU
oxFQKHV33AsgQ_AUICigB&biw=1440&bih=799

Spark Thrift Server - Integration
spark/sql/hive-
thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala
//result = sqlContext.sql(statement)  Original SQL Execution
//Integration of Gimel in Spark
result = GimelQueryProcessor.executeBatch(statement, sqlContext.sparkSession)

Livy - Integration
class SparkSqlInterpreter(conf: SparkConf) extends SparkInterpreter(conf) {
private val SCALA_MAGIC = "%%[sS][cC][aA][lL][aA] (.*)".r
private val PCATALOG_BATCH_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".r
private val PCATALOG_STREAM_MAGIC = "%%[gG][iI][mM][eE][lL](.*)".sS][tT][rR][eE][aA][mM] (.*)".r
// ........
// .....
override def execute(code: String, outputPath: String): Interpreter.ExecuteResponse = {
require(sparkContext != null && sqlContext != null && sparkSession != null)
code match {
case SCALA_MAGIC(scalaCode) =>
super.execute(scalaCode, null)
case PCATALOG_BATCH_MAGIC(gimelCode) =>
Try {
GimelQueryProcessor.executeBatch(gimelCode, sparkSession)
} match {
case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x)
case _ => Interpreter.ExecuteError("Failed", " ")
}
case PCATALOG_STREAM_MAGIC(gimelCode) =>
Try {
GimelQueryProcessor.executeStream(gimelCode, sparkSession)
} match {
case Success(x) => Interpreter.ExecuteSuccess(TEXT_PLAIN -> x)
case _ => Interpreter.ExecuteError("Failed", " ")
}
case _ =>
// ........
// .....
/repl/src/main/scala/com/cloudera/livy/repl/SparkSqlInterpreter.s
cala

PayPal Notebooks (Jupyter) - Integration
def _scala_pcatalog_command(self, sql_context_variable_name):
if sql_context_variable_name == u'spark':
command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new
ByteArrayOutputStream;Console.withOut(outCapture){{gimel.GimelQueryProcessor.executeBatch("""{}""",sparkSession)}}}}'.format(self.query)
else:
command = u'val output= {{import java.io.{{ByteArrayOutputStream, StringReader}};val outCapture = new
ByteArrayOutputStream;Console.withOut(outCapture){{gimel..GimelQueryProcessor.executeBatch("""{}""",{})}}}}'.format(self.query, sql_context_variable_name)
if self.samplemethod == u'sample':
command = u'{}.sample(false, {})'.format(command, self.samplefraction)
if self.maxrows >= 0:
command = u'{}.take({})'.format(command, self.maxrows)
else:
command = u'{}.collect'.format(command)
return Command(u'{}.foreach(println)'.format(command+';noutput'))
sparkmagic/sparkmagic/livyclientlib/sqlquery.py
sparkmagic/sparkmagic/kernels/sparkkernel/kernel.js
define(['base/js/namespace'], function(IPython){
var onload = function() {
{'reg':[/^%%sql/]};
IPython.CodeCell.config_defaults.highlight_modes['magic_text/x-python'] =
{'reg':[/^%%local/]};
{'reg':[/^%%gimel/]};}
return { onload: onload }
})

Connectors | High level
Storage Version API Implementation
Kafka 0.10.2 Batch & Stream Connectors – Implementation from scratch
Elastic Search 5.4.6 Connector | https://www.elastic.co/guide/en/elasticsearch/hadoop/5.4/spark.html
Additional implementations added in Gimel to support daily / monthly partitioned indexes in ES
Aerospike 3.1x Read | Aerospike Spark Connector(Aerospark) is used to read data directly into a DataFrame
(https://github.com/sasha-polev/aerospark)
Write | Aerospike Native Java Client Put API is used.
For each partition of the Dataframe a client connection is established, to write data from that partition to Aerospike.
HBASE 1.2 Connector | Horton Works HBASE Connector for Spark (SHC)
https://github.com/hortonworks-spark/shc
Cassandra 2.x Connector | DataStax Connector
https://github.com/datastax/spark-cassandra-connector
HIVE 1.2 Leverages spark APIs under the hood.
Druid 0.82 Connector | Leverages Tranquility under the hood
https://github.com/druid-io/tranquility
Teradata /
Relational
Leverages JDBC Storage Handler
Support for Batch Reads/Loads , FAST Load & FAST Exports
Alluxio Leverage Cross cluster access via reads using Spark Conf : spark.yarn.access.namenodes

Dataworks | 2018-06-20 | Gimel data platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Dataworks | 2018-06-20 | Gimel data platform

Similar to Dataworks | 2018-06-20 | Gimel data platform (20)

Recently uploaded

Recently uploaded (20)

Dataworks | 2018-06-20 | Gimel data platform