Gobblin @ NerdWallet (Nov 2015)


How NerdWallet uses Gobblin ( today, some pending contributions, and our future roadmap asks.

Published in: Software
  1. 1. Gobblin @ NerdWallet By Akshay Nanavati and Eric Ogren
  2. 2. Agenda ● Introduction to NerdWallet ● Gobblin @ NerdWallet Today ● Initial Pain Points & Learnings ● Contributions (Present and Future) ● Future Use Cases & Requests 2
  3. 3. What Is NerdWallet? ● Started in 2009. 275+ employees ● Highly profitable. Series A funding Feb 2015. ● We want to bring clarity to life’s financial decisions. 3
  4. 4. Front-End Services Tier NerdWallet Tech Stack Data Analytics Data Systems & Platforms 4
  5. 5. Data Types @ NerdWallet ● Partner Offer Data (MySQL & ElasticSearch: heavy reads, rare writes) ○ Synced to Redshift periodically ● Consumer Identity Data (Postgres: medium reads, medium writes) ● Site Generated Tracking Data (Redshift: heavy reads, heavy writes) ● Operational Data (e.g. Nginx logs) (Redshift: low reads, heavy writes) ? ● Internal Business Data (e.g. Salesforce) (Redshift: medium reads, rare writes) ● External 3rd Party Analytics Data (Redshift: medium reads, batch import) 5
  6. 6. Gobblin @ NW Today ● Running in standalone mode ● Ingests user tracking and operational log data ● Tracking Data: ○ ~10 Kafka topics - 1 per event & schema type ○ Hourly Gobblin jobs pull from kafka and dump to date-partitioned directory in S3 ○ Events are already serialized as protobuf in each Kafka topic ○ Around 100 events/second ● Log Ingestion (Operational Data): ○ Extracts data from AWS logs sitting in S3 ○ Parses log lines and serializes them to protobuf ○ Writes the serialized protobuf files back to S3 and eventually into redshift 6
  7. 7. Tracking Pipeline 7
  8. 8. Learnings: Deploying Gobblin w/Internal Code ● Have a repo of internal Gobblin modules (this is where we compile everything) ● Modified the build script to link the gobblin project to our gobblin-modules project ● Use jenkins to compile gobblin on the remote machine ● Maintain a separate repository with .pull files that we can sync with our stage and production environments 8
  9. 9. Current Contributions ● Simple Data Writer ○ class gobblin.writer.SimpleDataWriter ○ Writes binary record as bytes with no regard to encoding ○ Optionally prepends records by record size or uses a char delimiter at the end of records (i.e. n for string data) ● Kafka Simple Extractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleExtractor ○ class gobblin.source.extractor.extract.kafka.KafkaSimpleSource ○ Extracts binary data from Kafka as an array of bytes without any serde 9
  10. 10. Future Contributions ● Gobblin Dashboards ● S3 Source & Extractor ○ Given an S3 bucket, extract all files matching a regex ■ Leverages FileBasedExtractor ■ We would also like to modify this to have similar functionality to DatePartitionedDailyAvroSource ● S3 Publisher ○ Publishes files to S3 ○ Currently there is an issue where the AWS S3 Java API doesn’t work correctly with HDFS; since we are running in standalone this is not an issue for us 10
  11. 11. Future: Dashboards 11
  12. 12. Gobblin @ NW tomorrow ● More data types ○ Offer data from partners: JSON/CSV/XML over {HTTP, FTP} => S3 ○ Offer data from our site: MySQL => S3 (batch and incremental) ○ Identity data from out site: Postgres => S3 (batch and incremental, data hiding) ○ Salesforce Data ● Integration with Airflow DAGs ● Integration with data cleansing & entity matching frameworks 12
  13. 13. Early Adoption Pain Points & Solutions ● Best practices around for ingestion w/ transformation steps ● Initial problems integrating NW specific code (especially extractors & converters) into Gobblin’s build process ● Best practices around scheduler integration - Quartz (built-in) vs ETL schedulers ● Backwards incompatible changes caused us to make migrations to upgrade versions ● No changelogs & tagged releases 13
  14. 14. Things we would like to see/add in future ● Abstract out Avro specific code ● Best practices for scheduler integration (can contribute for Airflow) ● Clustering without requiring Hadoop & YARN ● Metadata support (job X produced files Y,Z) ● Release notes & tags :) ● The build & unit test process is very bloated ○ Hard to differentiate warnings/stack traces vs legitimate build issues ○ Opens ports, creates temporary dbs, etc which make it difficult to test on arbitrary machines (port collisions) 14
  15. 15. Thanks! Questions?? 15