Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

Building a Next-gen Data Platform
And Leveraging OSS Ecosystem for Easy Wins
Sean Quigley
Shutterstock

Myself
● Applying logic in varied domains
○ Physics and Economics in University
○ Quant Finance
○ Data Science in Ad Tech
○ Data Engineering
● Contact
○ squigley@shutterstock.com
○ https://github.com/seanpquig
○ https://twitter.com/s_quigls
○ https://www.linkedin.com/in/seanpquig

Shutterstock
● Global technology company
● High-quality licensed content for businesses, marketing/media
agencies
● 90M+ images, and 4M+ videos, music too
● 1.4M active customers in 150 countries
● Sell 5 images per second
● Data Infrastructure
○ Multi-petabyte YARN cluster
■ Hadoop, Hive, Spark, Oozie, Flink (POC)
○ Messaging + streaming solutions powered by Kafka
■ APIs for data production and consumption

Legacy Data Pipelines
● Logs (application and Nginx)
○ Flume
● Logs (Apache)
○ MariaDB
● Behavioral events
○ ZeroMQ + custom interfaces
○ JSON (messy)
● Time-series:
○ StatsD + CollectD + Graphite/Grafana
● Hadoop, Hive, and ETL:
○ Custom, home-grown jobs
○ Manual Hive DDL

This has become a bit of a Mess

Shutterstock Data Platform (SDP)
● End-to-end service pipeline
● Logs, user behavior/actions, click streams, time-series monitoring
● Ingestion and production of data
● Consumption of feeds by variable consumers
● Streaming and Batch
● ETL to long-term storage

9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY

Apache Kafka
● Pub-sub system modeled as a distributed commit log
● Highly scalable and performant
● Zookeeper for distributed coordination
● Topics are partitioned and replicated
● Consumers pull messages via a log offset per partition

Apache Kafka
Like a Streaming
Hadoop Cluster!
*from http://hortonworks.com/apache/kafka/

Apache Avro (Overview)
● Data serialization format
● Compact, fast, binary format
● Rich data structures
● Everything revolves around schemas

Apache Avro (Example)
{
type: record,
name: User,
fields: [
{
name: first_name,
type: string
},
{
name: age,
type: [null, int],
default: null
}
]
}

Confluent Platform
● Confluent, Inc.
● “Stream Data Platform”
● Strong influence on design of our Data Platform

Confluent Schema Registry (Overview)
● RESTful interface for storing and receiving Avro schemas
● Provides various compatibility settings for schema
evolution
● Confluent Kafka serializers

Confluent Schema Registry (Architecture)
*from http://docs.confluent.io/2.0.0/schema-registry/docs/design.html

SDP REST API (Intro)
● Language Agnostic protocol for producing Avro events
into Kafka
● But Confluent tries to solve this problem.
● Why not use their REST proxy?

SDP REST API vs. Confluent REST proxy
● Example event JSON
○ {“name”: “bill”, “age”: 27}
● To send this to the Confluent REST proxy:
○ {"value_schema": "{"type": "record", "name": "User", "fields":
[{"name": "name", "type": "string"}, {"name": "age", "type":
["null", "int"], "default": null}]}", "records": [{"value": {"name":
"bill", "age": {"int": 27}}}]}
○ {"value_schema": 41, "records": [{"value": {"name": "bill", "age":
{"int": 27}}}]}

SDP REST API (Overview)
● Written in Scala
● Clients send valid JSON
● JSON -> Avro schema inference and conversion
● All schema logic is fully recursive, so it works with
arbitrarily nested data.

SDP REST API (Features)
● Balance between ease of use and data structure
● Flexibility for evolution of schema
○ Easy to add and remove field
○ Some type evolutions permitted
● Schema maintains a historical record
● Error tracking and debugging tools for clients
○ Message UUIDs
○ Error topics in Kafka
○ Receive Timestamps

SDP REST API (Performance)
● 1st design iteration
○ SLOW: ~100 msg/s per CPU
● Optimizations
○ Be LAZY
○ Directly populate Avro bytes
○ In-memory cache of schemas on API nodes
○ Specialized data structures
● Led to performance of ~2000-3000 msg/s per CPU (20-
30X speedup)

Other APIs and Tools
● Consumer Service
○ Wraps the Kafka Consumer API in WebSocket
protocol
○ Support for group IDs
■ Consumer groups
■ Scale consumption horizontally
● Carbon API
○ Graphite-format time-series data in Kafka
● Clients
○ Node + Java producer/consumer clients

Camus (Overview)
● Specialized MapReduce Job for Kafka -> Hadoop ETL
● Open source
● Configuration driven

Camus (Architecture)
*from http://docs.confluent.io/2.0.0/camus/docs/design.html

Hive ETL (Overview)
● Hive DDL and DML that wrapped in Python scripts
● Scheduled via Oozie
● Schema-based approach really pays off here
○ Automated table management
○ Schema Evolution

Hive ETL (Example)
● Get latest historically compatible schema
○ schema.registry.net/subjects/topic_name-value/versions/latest
● Update avro table schema
○ ALTER TABLE topic_name
SET TBLPROPERTIES ('avro.schema.literal' = '{...}')

Hive ETL (What format?)
● Problem with Avro in Hive is that it is SLOW
● Let’s convert to something else!
○ Columnar (ORC, Parquet)
○ Easy OSS win!

Hive ETL (Columnar Conversion)
● Hive makes format conversion VERY EASY
● CREATE TABLE new_table STORED AS ORC
● INSERT … SELECT * FROM original_table
● We build a Python lib for wrapping this
○ Hive-format-converter
○ Supports schema evolution

Monitoring
● Nothing super sexy
● CodaHale/Dropwizard metrics is great with JVM
○ Know your metric types
■ Counters, meters, timers, gauges
● New Relic
● Icinga
● StatsD, CollectD
● Grafana, Graphite
● Health checks on APIs

Lessons learned
● Need for data engineers to speak different languages
○ Networking, Infrastructure, and Ops
○ Frontend and Web
○ Backend
○ Data Scientists and Business Analysts
● Data is UBIQUITOUS

Lessons learned
Logs tier →
Apps tier →

Lessons learned
● Data quality and usability should be a priority for all
○ Need to communicate and partner w teams
■ Product
■ Engineering
● Strike balance between standards and flexibility for clients
○ Too little => sloppy, hard-to-manage data
○ Too much => slows down and annoys teams

Lessons learned
● In a perfect world, teams have perfectly defined interfaces
● Perfect worlds do not exist
● Take the time to understand other teams’ code/systems
● Leads to better solutions influenced by diverse viewpoints

Future Work and Possibilities
● Admin UI
○ Self-service topic creation
○ Finer-grained schema control
● More robust offset management in Consumer API
● Streaming as a Service

Predicting future re-design
● Momentum of Kafka Ecosystem protects us partially
● Kafka Connect looks promising!
○ Framework for copying to/from Kafka
○ Looks to solve common pain points
● NiFi promising too!
○ Could potentially replace ingestion pieces
● More continuous spectrum of structure/flexibility tradeoff

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

Similar to Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016 (20)

More from StampedeCon

More from StampedeCon (20)

Recently uploaded

Recently uploaded (20)

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016