Hadoop at Lookout
Aug 13, 2014
Yash Ranadive
@yashranadive
Thursday, August 14, 14
BIO
• Data Engineer
• From Mumbai, India
• Lived in 7 different cities in US
• @yashranadive
• etl.svbtle.com
Thursday, August 14, 14
AGENDA
• What we do @Lookout
• Data warehouse
• Evolution from monolithic to micro-services
• Protocol Buffers
• Areas we are exploring
Thursday, August 14, 14
WHAT WE DO
@LOOKOUT
Thursday, August 14, 14
Over 50 million registered users
Thursday, August 14, 14
DATA TEAM
• 3 Data Engineers
• 6 data analysts
• Hadoop
• 64 hosts
• 300 TB capacity
Thursday, August 14, 14
DATA WAREHOUSE
INTERNAL AND EXTERNAL DATA SOURCES
MySQL Star
Schema
Warehouse
HDFS
HIVE HBase Impala
Chunker
Mudskipper
R Hue Shiny Tableau Custom
Apps
WAREHOUSE
Thursday, August 14, 14
FROM MONOLITHIC TO
MICROSERVICES
Thursday, August 14, 14
MONOLITHIC APPLICATION
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Thursday, August 14, 14
DATA INGESTION - MONOLITHIC
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Ingestion is batch-oriented
Thursday, August 14, 14
PROBLEM
• Rails has fast TTM but challenges in scaling
• One code base
• Slower Deployments
• Too complex and large to manage
• Solution
• Microservices / service oriented architecture
• Break out the app in to smaller services
Thursday, August 14, 14
MICROSERVICES ARCHITECTURE
Routing
Controller
Mobile/Web Clients
Database
RAILS APPLICATION
HTTP
ORM
Views
Tables
Settings
Service
Photo
Backup
We frequently add new services
Thursday, August 14, 14
DATA INGESTION - MICROSERVICES
Application master_db slave_db
Data Warehouse
MySQL Hive
ETL
ELT
MySQL
Replication
External
Sources
Reporting
Settings
Service
Backup
Service
Locate
Service
Messaging
Layer
Consumer
Thursday, August 14, 14
DATA INGESTION -
MONOLITIHIC VS MICROSERVICES
select * from user_settings;
id | setting_id | user_id | modified_at
===========================
1 backup 2629 20140709T0400Z
3 locate 2682 20140709T0402Z
8 wipe 2629 20140709T0403Z
9 theft_alert 2629 20140709T0407Z
{guid: 1, event_type: “modify_setting”,
setting_id: “backup”, setting_status:
“ON”, user_id: “2629”, timestamp:
“20140709T0400Z”}
{guid: 3, event_type: “start_backup”,
user_id: “2629”, timestamp:
“20140709T0400Z”}
...
Monolithic - Snapshot of a
point in time
Microservices - Events
Thursday, August 14, 14
DESIGN
• We wanted to create an always-on event
ingestion framework that:
• Would scale workers on demand
• Would be easy to monitor
Thursday, August 14, 14
FIRST STAB - WORKER
Service ActiveMQ Ruby Worker HIVE
• Upstart script that daemonized Ruby process
• Monitoring using Zenoss
• Very easy to set up
• Mapping Files for JSON -> CSV
• Ruby is terse and clean
Thursday, August 14, 14
PROBLEMS
• ActiveMQ
• ActiveMQ did not scale well - even with
multiple machines in the AMQ cluster
• ActiveMQ creates a separate queue for every
consumer of the topic
• Monitoring using Zenoss is not ideal especially for
multi-process consumers
• The worker ran on a single machine- not fault
tolerant
Thursday, August 14, 14
CURRENT ARCHITECTURE - WORKER
Service Kafka Storm HIVE
• Monitoring using Storm’s thrift API
• Scaling number of workers is easy
• Kafka has better scalability than Kafka
Service ActiveMQ
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Processing
Bolt
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
JSON PROBLEMS
• Problems with JSON
• No predefined schema
• No enforcement of backward compatibility
• Solution
• Protocol Buffers (also Avro/Thrift)
Thursday, August 14, 14
PROTOBUFS
• What?
• Way of encoding structured data
• Binary
• Why?
• Schema
• Backward compatibility
• Smaller in size than JSON
Thursday, August 14, 14
VERSIONING
• backward compatible changes only
,proto ,proto
Version 1.4 Version 1.1
Producer ConsumerQueue
Thursday, August 14, 14
SHARING PROTOBUF SCHEMAS
Artifactory
(Schema Repo)
Data Team
Storm
Project
Producers
Push
Java jars
Ruby gems
Pull
Java jars
Thursday, August 14, 14
BUT HOW DO YOU STORE
PROTOBUFS IN HDFS?
Thursday, August 14, 14
HOW WE STORE PROTOBUFS
• Store raw version
• Raw dump of kafka topic in to HDFS
• Convert them to a tuple using Storm
• Inflate then convert to TSV
• Can query raw protobufs directly from HIVE but we
don’t yet
• elephant-bird (difficult to get it working)
Thursday, August 14, 14
Storm
STORM TOPOLOGY
Service Kafka HDFS
Kafka
Spout
ActiveMQ
Spout
Deserialize
Protobuf
Storm-hdfs
bolt
Landing
Directory
Hive
Directory
Thursday, August 14, 14
AREAS WE ARE
EXPLORING
Thursday, August 14, 14
SPARK
• ETL
• Wordcount ~5 lines of scala code vs. 58 lines of
Java Map reduce code
• Spark Streaming can achieve similar results as of
storm through micro-batching
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
• Machine Learning
• Online learning using MLLIB
• Logistic Regression and SVM
Thursday, August 14, 14
H20
• In-memory machine learning
• Tight integration with R
• Preferred by Data Scientists
Thursday, August 14, 14
OPEN SOURCE PROJECTS
• Currently open sourced
• Pipefish - write from MySQL to HDFS
github.com/lookout/pipefish
• Future
• Mudskipper - capture change-data
events from MySQL binlogs.
• Chunker - download mysql table data
in chunks
Thursday, August 14, 14
Questions
Thursday, August 14, 14

SF Hadoop Users Group August 2014 Meetup Slides

  • 1.
    Hadoop at Lookout Aug13, 2014 Yash Ranadive @yashranadive Thursday, August 14, 14
  • 2.
    BIO • Data Engineer •From Mumbai, India • Lived in 7 different cities in US • @yashranadive • etl.svbtle.com Thursday, August 14, 14
  • 3.
    AGENDA • What wedo @Lookout • Data warehouse • Evolution from monolithic to micro-services • Protocol Buffers • Areas we are exploring Thursday, August 14, 14
  • 4.
  • 5.
    Over 50 millionregistered users Thursday, August 14, 14
  • 6.
    DATA TEAM • 3Data Engineers • 6 data analysts • Hadoop • 64 hosts • 300 TB capacity Thursday, August 14, 14
  • 7.
    DATA WAREHOUSE INTERNAL ANDEXTERNAL DATA SOURCES MySQL Star Schema Warehouse HDFS HIVE HBase Impala Chunker Mudskipper R Hue Shiny Tableau Custom Apps WAREHOUSE Thursday, August 14, 14
  • 8.
  • 9.
    MONOLITHIC APPLICATION Routing Controller Mobile/Web Clients Database RAILSAPPLICATION HTTP ORM Views Tables Thursday, August 14, 14
  • 10.
    DATA INGESTION -MONOLITHIC Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Ingestion is batch-oriented Thursday, August 14, 14
  • 11.
    PROBLEM • Rails hasfast TTM but challenges in scaling • One code base • Slower Deployments • Too complex and large to manage • Solution • Microservices / service oriented architecture • Break out the app in to smaller services Thursday, August 14, 14
  • 12.
    MICROSERVICES ARCHITECTURE Routing Controller Mobile/Web Clients Database RAILSAPPLICATION HTTP ORM Views Tables Settings Service Photo Backup We frequently add new services Thursday, August 14, 14
  • 13.
    DATA INGESTION -MICROSERVICES Application master_db slave_db Data Warehouse MySQL Hive ETL ELT MySQL Replication External Sources Reporting Settings Service Backup Service Locate Service Messaging Layer Consumer Thursday, August 14, 14
  • 14.
    DATA INGESTION - MONOLITIHICVS MICROSERVICES select * from user_settings; id | setting_id | user_id | modified_at =========================== 1 backup 2629 20140709T0400Z 3 locate 2682 20140709T0402Z 8 wipe 2629 20140709T0403Z 9 theft_alert 2629 20140709T0407Z {guid: 1, event_type: “modify_setting”, setting_id: “backup”, setting_status: “ON”, user_id: “2629”, timestamp: “20140709T0400Z”} {guid: 3, event_type: “start_backup”, user_id: “2629”, timestamp: “20140709T0400Z”} ... Monolithic - Snapshot of a point in time Microservices - Events Thursday, August 14, 14
  • 15.
    DESIGN • We wantedto create an always-on event ingestion framework that: • Would scale workers on demand • Would be easy to monitor Thursday, August 14, 14
  • 16.
    FIRST STAB -WORKER Service ActiveMQ Ruby Worker HIVE • Upstart script that daemonized Ruby process • Monitoring using Zenoss • Very easy to set up • Mapping Files for JSON -> CSV • Ruby is terse and clean Thursday, August 14, 14
  • 17.
    PROBLEMS • ActiveMQ • ActiveMQdid not scale well - even with multiple machines in the AMQ cluster • ActiveMQ creates a separate queue for every consumer of the topic • Monitoring using Zenoss is not ideal especially for multi-process consumers • The worker ran on a single machine- not fault tolerant Thursday, August 14, 14
  • 18.
    CURRENT ARCHITECTURE -WORKER Service Kafka Storm HIVE • Monitoring using Storm’s thrift API • Scaling number of workers is easy • Kafka has better scalability than Kafka Service ActiveMQ Thursday, August 14, 14
  • 19.
    Storm STORM TOPOLOGY Service KafkaHDFS Kafka Spout ActiveMQ Spout Processing Bolt Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 20.
    JSON PROBLEMS • Problemswith JSON • No predefined schema • No enforcement of backward compatibility • Solution • Protocol Buffers (also Avro/Thrift) Thursday, August 14, 14
  • 21.
    PROTOBUFS • What? • Wayof encoding structured data • Binary • Why? • Schema • Backward compatibility • Smaller in size than JSON Thursday, August 14, 14
  • 22.
    VERSIONING • backward compatiblechanges only ,proto ,proto Version 1.4 Version 1.1 Producer ConsumerQueue Thursday, August 14, 14
  • 23.
    SHARING PROTOBUF SCHEMAS Artifactory (SchemaRepo) Data Team Storm Project Producers Push Java jars Ruby gems Pull Java jars Thursday, August 14, 14
  • 24.
    BUT HOW DOYOU STORE PROTOBUFS IN HDFS? Thursday, August 14, 14
  • 25.
    HOW WE STOREPROTOBUFS • Store raw version • Raw dump of kafka topic in to HDFS • Convert them to a tuple using Storm • Inflate then convert to TSV • Can query raw protobufs directly from HIVE but we don’t yet • elephant-bird (difficult to get it working) Thursday, August 14, 14
  • 26.
    Storm STORM TOPOLOGY Service KafkaHDFS Kafka Spout ActiveMQ Spout Deserialize Protobuf Storm-hdfs bolt Landing Directory Hive Directory Thursday, August 14, 14
  • 27.
  • 28.
    SPARK • ETL • Wordcount~5 lines of scala code vs. 58 lines of Java Map reduce code • Spark Streaming can achieve similar results as of storm through micro-batching http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming • Machine Learning • Online learning using MLLIB • Logistic Regression and SVM Thursday, August 14, 14
  • 29.
    H20 • In-memory machinelearning • Tight integration with R • Preferred by Data Scientists Thursday, August 14, 14
  • 30.
    OPEN SOURCE PROJECTS •Currently open sourced • Pipefish - write from MySQL to HDFS github.com/lookout/pipefish • Future • Mudskipper - capture change-data events from MySQL binlogs. • Chunker - download mysql table data in chunks Thursday, August 14, 14
  • 31.