SlideShare a Scribd company logo
Building a Next-gen Data Platform
And Leveraging OSS Ecosystem for Easy Wins
Sean Quigley
Shutterstock
Myself
● Applying logic in varied domains
○ Physics and Economics in University
○ Quant Finance
○ Data Science in Ad Tech
○ Data Engineering
● Contact
○ squigley@shutterstock.com
○ https://github.com/seanpquig
○ https://twitter.com/s_quigls
○ https://www.linkedin.com/in/seanpquig
Shutterstock
● Global technology company
● High-quality licensed content for businesses, marketing/media
agencies
● 90M+ images, and 4M+ videos, music too
● 1.4M active customers in 150 countries
● Sell 5 images per second
● Data Infrastructure
○ Multi-petabyte YARN cluster
■ Hadoop, Hive, Spark, Oozie, Flink (POC)
○ Messaging + streaming solutions powered by Kafka
■ APIs for data production and consumption
Common Systems Lifecycle
Legacy Data Pipelines
● Logs (application and Nginx)
○ Flume
● Logs (Apache)
○ MariaDB
● Behavioral events
○ ZeroMQ + custom interfaces
○ JSON (messy)
● Time-series:
○ StatsD + CollectD + Graphite/Grafana
● Hadoop, Hive, and ETL:
○ Custom, home-grown jobs
○ Manual Hive DDL
This has become a bit of a Mess
Time to build
something new
Shutterstock Data Platform (SDP)
● End-to-end service pipeline
● Logs, user behavior/actions, click streams, time-series monitoring
● Ingestion and production of data
● Consumption of feeds by variable consumers
● Streaming and Batch
● ETL to long-term storage
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
Apache Kafka
● Pub-sub system modeled as a distributed commit log
● Highly scalable and performant
● Zookeeper for distributed coordination
● Topics are partitioned and replicated
● Consumers pull messages via a log offset per partition
Apache Kafka
Like a Streaming
Hadoop Cluster!
*from http://hortonworks.com/apache/kafka/
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
Apache Avro (Overview)
● Data serialization format
● Compact, fast, binary format
● Rich data structures
● Everything revolves around schemas
Apache Avro (Example)
{
type: record,
name: User,
fields: [
{
name: first_name,
type: string
},
{
name: age,
type: [null, int],
default: null
}
]
}
Confluent Platform
● Confluent, Inc.
● “Stream Data Platform”
● Strong influence on design of our Data Platform
Confluent Schema Registry (Overview)
● RESTful interface for storing and receiving Avro schemas
● Provides various compatibility settings for schema
evolution
● Confluent Kafka serializers
Confluent Schema Registry (Architecture)
*from http://docs.confluent.io/2.0.0/schema-registry/docs/design.html
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
SDP REST API (Intro)
● Language Agnostic protocol for producing Avro events
into Kafka
● But Confluent tries to solve this problem.
● Why not use their REST proxy?
SDP REST API vs. Confluent REST proxy
● Example event JSON
○ {“name”: “bill”, “age”: 27}
● To send this to the Confluent REST proxy:
○ {"value_schema": "{"type": "record", "name": "User", "fields":
[{"name": "name", "type": "string"}, {"name": "age", "type":
["null", "int"], "default": null}]}", "records": [{"value": {"name":
"bill", "age": {"int": 27}}}]}
○ {"value_schema": 41, "records": [{"value": {"name": "bill", "age":
{"int": 27}}}]}
SDP REST API (Overview)
● Written in Scala
● Clients send valid JSON
● JSON -> Avro schema inference and conversion
● All schema logic is fully recursive, so it works with
arbitrarily nested data.
SDP REST API (Features)
● Balance between ease of use and data structure
● Flexibility for evolution of schema
○ Easy to add and remove field
○ Some type evolutions permitted
● Schema maintains a historical record
● Error tracking and debugging tools for clients
○ Message UUIDs
○ Error topics in Kafka
○ Receive Timestamps
SDP REST API (Performance)
● 1st design iteration
○ SLOW: ~100 msg/s per CPU
● Optimizations
○ Be LAZY
○ Directly populate Avro bytes
○ In-memory cache of schemas on API nodes
○ Specialized data structures
● Led to performance of ~2000-3000 msg/s per CPU (20-
30X speedup)
Other APIs and Tools
● Consumer Service
○ Wraps the Kafka Consumer API in WebSocket
protocol
○ Support for group IDs
■ Consumer groups
■ Scale consumption horizontally
● Carbon API
○ Graphite-format time-series data in Kafka
● Clients
○ Node + Java producer/consumer clients
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
Camus (Overview)
● Specialized MapReduce Job for Kafka -> Hadoop ETL
● Open source
● Configuration driven
Camus (Architecture)
*from http://docs.confluent.io/2.0.0/camus/docs/design.html
Hive ETL (Overview)
● Hive DDL and DML that wrapped in Python scripts
● Scheduled via Oozie
● Schema-based approach really pays off here
○ Automated table management
○ Schema Evolution
Hive ETL (Example)
● Get latest historically compatible schema
○ schema.registry.net/subjects/topic_name-value/versions/latest
● Update avro table schema
○ ALTER TABLE topic_name
SET TBLPROPERTIES ('avro.schema.literal' = '{...}')
Hive ETL (What format?)
● Problem with Avro in Hive is that it is SLOW
● Let’s convert to something else!
○ Columnar (ORC, Parquet)
○ Easy OSS win!
ORC Format
Hive ETL (Columnar Conversion)
● Hive makes format conversion VERY EASY
● CREATE TABLE new_table STORED AS ORC
● INSERT … SELECT * FROM original_table
● We build a Python lib for wrapping this
○ Hive-format-converter
○ Supports schema evolution
9 Shutterstock Data Design principles
● Extremely scalable
● Structured data
● Language-agnostic protocols for data production/consumption
● Fault-tolerant
● Automation
● Unification of disparate, fractured pipelines
● Error tracking and debugging
● Well-monitored
● Leverage latest from the OSS community and minimize DIY
Monitoring
● Nothing super sexy
● CodaHale/Dropwizard metrics is great with JVM
○ Know your metric types
■ Counters, meters, timers, gauges
● New Relic
● Icinga
● StatsD, CollectD
● Grafana, Graphite
● Health checks on APIs
Lessons learned
● Need for data engineers to speak different languages
○ Networking, Infrastructure, and Ops
○ Frontend and Web
○ Backend
○ Data Scientists and Business Analysts
● Data is UBIQUITOUS
Lessons learned
Logs tier →
Apps tier →
Lessons learned
● Data quality and usability should be a priority for all
○ Need to communicate and partner w teams
■ Product
■ Engineering
● Strike balance between standards and flexibility for clients
○ Too little => sloppy, hard-to-manage data
○ Too much => slows down and annoys teams
Lessons learned
● In a perfect world, teams have perfectly defined interfaces
● Perfect worlds do not exist
● Take the time to understand other teams’ code/systems
● Leads to better solutions influenced by diverse viewpoints
Future Work and Possibilities
● Admin UI
○ Self-service topic creation
○ Finer-grained schema control
● More robust offset management in Consumer API
● Streaming as a Service
Remember This?
Predicting future re-design
● Momentum of Kafka Ecosystem protects us partially
● Kafka Connect looks promising!
○ Framework for copying to/from Kafka
○ Looks to solve common pain points
● NiFi promising too!
○ Could potentially replace ingestion pieces
● More continuous spectrum of structure/flexibility tradeoff
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

More Related Content

What's hot

Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
John Yeung
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Hadoop for the Masses
Hadoop for the MassesHadoop for the Masses
Hadoop for the Masses
DataWorks Summit/Hadoop Summit
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Mark Rittman
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
StampedeCon
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
Mark Kromer
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
Mark Rittman
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
Rittman Analytics
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
Mark Kromer
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Mark Rittman
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
ODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration HubODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration Hub
Mark Rittman
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
Mark Rittman
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
Amazon Web Services
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Mark Rittman
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
DataWorks Summit
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
Mark Rittman
 

What's hot (20)

Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
 
Hadoop for the Masses
Hadoop for the MassesHadoop for the Masses
Hadoop for the Masses
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
What’s New in Spark 2.0: Structured Streaming and Datasets - StampedeCon 2016
 
Pentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and HadoopPentaho Big Data Analytics with Vertica and Hadoop
Pentaho Big Data Analytics with Vertica and Hadoop
 
Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...Unlock the value in your big data reservoir using oracle big data discovery a...
Unlock the value in your big data reservoir using oracle big data discovery a...
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017How a Tweet Went Viral - BIWA Summit 2017
How a Tweet Went Viral - BIWA Summit 2017
 
Pentaho Analytics on MongoDB
Pentaho Analytics on MongoDBPentaho Analytics on MongoDB
Pentaho Analytics on MongoDB
 
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's ToolkitUsing Oracle Big Data Discovey as a Data Scientist's Toolkit
Using Oracle Big Data Discovey as a Data Scientist's Toolkit
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
ODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration HubODI12c as your Big Data Integration Hub
ODI12c as your Big Data Integration Hub
 
What is Big Data Discovery, and how it complements traditional business anal...
What is Big Data Discovery, and how it complements  traditional business anal...What is Big Data Discovery, and how it complements  traditional business anal...
What is Big Data Discovery, and how it complements traditional business anal...
 
Big Data & Data Lakes Building Blocks
Big Data & Data Lakes Building BlocksBig Data & Data Lakes Building Blocks
Big Data & Data Lakes Building Blocks
 
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
Social Network Analysis using Oracle Big Data Spatial & Graph (incl. why I di...
 
Hadoop Journey at Walgreens
Hadoop Journey at WalgreensHadoop Journey at Walgreens
Hadoop Journey at Walgreens
 
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
End to-end hadoop development using OBIEE, ODI, Oracle Big Data SQL and Oracl...
 

Viewers also liked

Searching Images by Color Using Solr
Searching Images by Color Using SolrSearching Images by Color Using Solr
Searching Images by Color Using Solr
Chris Becker
 
Searching Images by Color: Presented by Chris Becker, Shutterstock
Searching Images by Color: Presented by Chris Becker, ShutterstockSearching Images by Color: Presented by Chris Becker, Shutterstock
Searching Images by Color: Presented by Chris Becker, Shutterstock
Lucidworks
 
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
StampedeCon
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
StampedeCon
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
StampedeCon
 
Avro intro
Avro introAvro intro
Avro intro
Randy Abernethy
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
StampedeCon
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The Fundamentals
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Chicago Hadoop Users Group
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
Hortonworks
 
อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์
I'Tay Tanawin
 
China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...Qianzhan Intelligence
 
China clothing industry production & marketing demand and development forecas...
China clothing industry production & marketing demand and development forecas...China clothing industry production & marketing demand and development forecas...
China clothing industry production & marketing demand and development forecas...Qianzhan Intelligence
 
제주도에어카텔 하이난할인항공권
제주도에어카텔 하이난할인항공권제주도에어카텔 하이난할인항공권
제주도에어카텔 하이난할인항공권
hwseywe
 
Začněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
Začněte testovat na dálku. Levnější už to nebude. - Petr ŠtědrýZačněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
Začněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
Akce Dobrého webu
 

Viewers also liked (20)

Searching Images by Color Using Solr
Searching Images by Color Using SolrSearching Images by Color Using Solr
Searching Images by Color Using Solr
 
Searching Images by Color: Presented by Chris Becker, Shutterstock
Searching Images by Color: Presented by Chris Becker, ShutterstockSearching Images by Color: Presented by Chris Becker, Shutterstock
Searching Images by Color: Presented by Chris Becker, Shutterstock
 
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
Batch and Real-time EHR updates into Hadoop - StampedeCon 2015
 
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
Enterprise Search: Addressing the First Problem of Big Data & Analytics - Sta...
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
Avro intro
Avro introAvro intro
Avro intro
 
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
The Big Data Journey – How Companies Adopt Hadoop - StampedeCon 2016
 
Avro
AvroAvro
Avro
 
Visualizing Big Data – The Fundamentals
Visualizing Big Data – The FundamentalsVisualizing Big Data – The Fundamentals
Visualizing Big Data – The Fundamentals
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
 
SQL to Hive Cheat Sheet
SQL to Hive Cheat SheetSQL to Hive Cheat Sheet
SQL to Hive Cheat Sheet
 
-11031502
-11031502-11031502
-11031502
 
อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์อุปกรณ์พื้นฐานคอมพิวเตอร์
อุปกรณ์พื้นฐานคอมพิวเตอร์
 
China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...China cardiovascular system drugs industry market demand forecast and investm...
China cardiovascular system drugs industry market demand forecast and investm...
 
China clothing industry production & marketing demand and development forecas...
China clothing industry production & marketing demand and development forecas...China clothing industry production & marketing demand and development forecas...
China clothing industry production & marketing demand and development forecas...
 
제주도에어카텔 하이난할인항공권
제주도에어카텔 하이난할인항공권제주도에어카텔 하이난할인항공권
제주도에어카텔 하이난할인항공권
 
Začněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
Začněte testovat na dálku. Levnější už to nebude. - Petr ŠtědrýZačněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
Začněte testovat na dálku. Levnější už to nebude. - Petr Štědrý
 

Similar to Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
Zhenxiao Luo
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
Lars Albertsson
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
Kyle Bader
 
Change data capture
Change data captureChange data capture
Change data capture
Ron Barabash
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
Christopher Dubois
 
From leading IoT Protocols to Python Dashboarding_final
From leading IoT Protocols to Python Dashboarding_finalFrom leading IoT Protocols to Python Dashboarding_final
From leading IoT Protocols to Python Dashboarding_final
Lukas Ott
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Data streaming
Data streamingData streaming
Data streaming
Alberto Paro
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
Omid Vahdaty
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
Dmytro Semenov
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
datamantra
 
Cloud Native API Design and Management
Cloud Native API Design and ManagementCloud Native API Design and Management
Cloud Native API Design and Management
AllBits BVBA (freelancer)
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
Hotstar
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually rest
Jakub Riedl
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
Hari Shreedharan
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Databricks
 

Similar to Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016 (20)

Machine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systemsMachine learning and big data @ uber a tale of two systems
Machine learning and big data @ uber a tale of two systems
 
Testing data streaming applications
Testing data streaming applicationsTesting data streaming applications
Testing data streaming applications
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Red hat infrastructure for analytics
Red hat infrastructure for analyticsRed hat infrastructure for analytics
Red hat infrastructure for analytics
 
Change data capture
Change data captureChange data capture
Change data capture
 
Summer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpointSummer 2017 undergraduate research powerpoint
Summer 2017 undergraduate research powerpoint
 
From leading IoT Protocols to Python Dashboarding_final
From leading IoT Protocols to Python Dashboarding_finalFrom leading IoT Protocols to Python Dashboarding_final
From leading IoT Protocols to Python Dashboarding_final
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Data streaming
Data streamingData streaming
Data streaming
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Introduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQLIntroduction to Structured Data Processing with Spark SQL
Introduction to Structured Data Processing with Spark SQL
 
Cloud Native API Design and Management
Cloud Native API Design and ManagementCloud Native API Design and Management
Cloud Native API Design and Management
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Build real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache KafkaBuild real time stream processing applications using Apache Kafka
Build real time stream processing applications using Apache Kafka
 
GraphQL is actually rest
GraphQL is actually restGraphQL is actually rest
GraphQL is actually rest
 
Streamsets and spark in Retail
Streamsets and spark in RetailStreamsets and spark in Retail
Streamsets and spark in Retail
 
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari ShreedharanAnalytic Insights in Retail Using Apache Spark with Hari Shreedharan
Analytic Insights in Retail Using Apache Spark with Hari Shreedharan
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
StampedeCon
 
Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
 
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016
 
Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016Resource Management in Impala - StampedeCon 2016
Resource Management in Impala - StampedeCon 2016
 

Recently uploaded

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy Wins - StampedeCon 2016

  • 1. Building a Next-gen Data Platform And Leveraging OSS Ecosystem for Easy Wins Sean Quigley Shutterstock
  • 2. Myself ● Applying logic in varied domains ○ Physics and Economics in University ○ Quant Finance ○ Data Science in Ad Tech ○ Data Engineering ● Contact ○ squigley@shutterstock.com ○ https://github.com/seanpquig ○ https://twitter.com/s_quigls ○ https://www.linkedin.com/in/seanpquig
  • 3. Shutterstock ● Global technology company ● High-quality licensed content for businesses, marketing/media agencies ● 90M+ images, and 4M+ videos, music too ● 1.4M active customers in 150 countries ● Sell 5 images per second ● Data Infrastructure ○ Multi-petabyte YARN cluster ■ Hadoop, Hive, Spark, Oozie, Flink (POC) ○ Messaging + streaming solutions powered by Kafka ■ APIs for data production and consumption
  • 5. Legacy Data Pipelines ● Logs (application and Nginx) ○ Flume ● Logs (Apache) ○ MariaDB ● Behavioral events ○ ZeroMQ + custom interfaces ○ JSON (messy) ● Time-series: ○ StatsD + CollectD + Graphite/Grafana ● Hadoop, Hive, and ETL: ○ Custom, home-grown jobs ○ Manual Hive DDL
  • 6. This has become a bit of a Mess
  • 8. Shutterstock Data Platform (SDP) ● End-to-end service pipeline ● Logs, user behavior/actions, click streams, time-series monitoring ● Ingestion and production of data ● Consumption of feeds by variable consumers ● Streaming and Batch ● ETL to long-term storage
  • 9. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 10. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 11. Apache Kafka ● Pub-sub system modeled as a distributed commit log ● Highly scalable and performant ● Zookeeper for distributed coordination ● Topics are partitioned and replicated ● Consumers pull messages via a log offset per partition
  • 12. Apache Kafka Like a Streaming Hadoop Cluster! *from http://hortonworks.com/apache/kafka/
  • 13. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 14. Apache Avro (Overview) ● Data serialization format ● Compact, fast, binary format ● Rich data structures ● Everything revolves around schemas
  • 15. Apache Avro (Example) { type: record, name: User, fields: [ { name: first_name, type: string }, { name: age, type: [null, int], default: null } ] }
  • 16. Confluent Platform ● Confluent, Inc. ● “Stream Data Platform” ● Strong influence on design of our Data Platform
  • 17. Confluent Schema Registry (Overview) ● RESTful interface for storing and receiving Avro schemas ● Provides various compatibility settings for schema evolution ● Confluent Kafka serializers
  • 18. Confluent Schema Registry (Architecture) *from http://docs.confluent.io/2.0.0/schema-registry/docs/design.html
  • 19. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 20. SDP REST API (Intro) ● Language Agnostic protocol for producing Avro events into Kafka ● But Confluent tries to solve this problem. ● Why not use their REST proxy?
  • 21. SDP REST API vs. Confluent REST proxy ● Example event JSON ○ {“name”: “bill”, “age”: 27} ● To send this to the Confluent REST proxy: ○ {"value_schema": "{"type": "record", "name": "User", "fields": [{"name": "name", "type": "string"}, {"name": "age", "type": ["null", "int"], "default": null}]}", "records": [{"value": {"name": "bill", "age": {"int": 27}}}]} ○ {"value_schema": 41, "records": [{"value": {"name": "bill", "age": {"int": 27}}}]}
  • 22. SDP REST API (Overview) ● Written in Scala ● Clients send valid JSON ● JSON -> Avro schema inference and conversion ● All schema logic is fully recursive, so it works with arbitrarily nested data.
  • 23. SDP REST API (Features) ● Balance between ease of use and data structure ● Flexibility for evolution of schema ○ Easy to add and remove field ○ Some type evolutions permitted ● Schema maintains a historical record ● Error tracking and debugging tools for clients ○ Message UUIDs ○ Error topics in Kafka ○ Receive Timestamps
  • 24. SDP REST API (Performance) ● 1st design iteration ○ SLOW: ~100 msg/s per CPU ● Optimizations ○ Be LAZY ○ Directly populate Avro bytes ○ In-memory cache of schemas on API nodes ○ Specialized data structures ● Led to performance of ~2000-3000 msg/s per CPU (20- 30X speedup)
  • 25. Other APIs and Tools ● Consumer Service ○ Wraps the Kafka Consumer API in WebSocket protocol ○ Support for group IDs ■ Consumer groups ■ Scale consumption horizontally ● Carbon API ○ Graphite-format time-series data in Kafka ● Clients ○ Node + Java producer/consumer clients
  • 26. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 27. Camus (Overview) ● Specialized MapReduce Job for Kafka -> Hadoop ETL ● Open source ● Configuration driven
  • 29. Hive ETL (Overview) ● Hive DDL and DML that wrapped in Python scripts ● Scheduled via Oozie ● Schema-based approach really pays off here ○ Automated table management ○ Schema Evolution
  • 30. Hive ETL (Example) ● Get latest historically compatible schema ○ schema.registry.net/subjects/topic_name-value/versions/latest ● Update avro table schema ○ ALTER TABLE topic_name SET TBLPROPERTIES ('avro.schema.literal' = '{...}')
  • 31. Hive ETL (What format?) ● Problem with Avro in Hive is that it is SLOW ● Let’s convert to something else! ○ Columnar (ORC, Parquet) ○ Easy OSS win!
  • 33. Hive ETL (Columnar Conversion) ● Hive makes format conversion VERY EASY ● CREATE TABLE new_table STORED AS ORC ● INSERT … SELECT * FROM original_table ● We build a Python lib for wrapping this ○ Hive-format-converter ○ Supports schema evolution
  • 34. 9 Shutterstock Data Design principles ● Extremely scalable ● Structured data ● Language-agnostic protocols for data production/consumption ● Fault-tolerant ● Automation ● Unification of disparate, fractured pipelines ● Error tracking and debugging ● Well-monitored ● Leverage latest from the OSS community and minimize DIY
  • 35. Monitoring ● Nothing super sexy ● CodaHale/Dropwizard metrics is great with JVM ○ Know your metric types ■ Counters, meters, timers, gauges ● New Relic ● Icinga ● StatsD, CollectD ● Grafana, Graphite ● Health checks on APIs
  • 36. Lessons learned ● Need for data engineers to speak different languages ○ Networking, Infrastructure, and Ops ○ Frontend and Web ○ Backend ○ Data Scientists and Business Analysts ● Data is UBIQUITOUS
  • 37. Lessons learned Logs tier → Apps tier →
  • 38. Lessons learned ● Data quality and usability should be a priority for all ○ Need to communicate and partner w teams ■ Product ■ Engineering ● Strike balance between standards and flexibility for clients ○ Too little => sloppy, hard-to-manage data ○ Too much => slows down and annoys teams
  • 39. Lessons learned ● In a perfect world, teams have perfectly defined interfaces ● Perfect worlds do not exist ● Take the time to understand other teams’ code/systems ● Leads to better solutions influenced by diverse viewpoints
  • 40. Future Work and Possibilities ● Admin UI ○ Self-service topic creation ○ Finer-grained schema control ● More robust offset management in Consumer API ● Streaming as a Service
  • 42. Predicting future re-design ● Momentum of Kafka Ecosystem protects us partially ● Kafka Connect looks promising! ○ Framework for copying to/from Kafka ○ Looks to solve common pain points ● NiFi promising too! ○ Could potentially replace ingestion pieces ● More continuous spectrum of structure/flexibility tradeoff