About VisualDNA Architecture @ Rubyslava 2014


Published on

The journey we took at VisualDNA in transforming our architecture from LAMP through BATCH to LAMBDA.

Published in: Technology, Business
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

About VisualDNA Architecture @ Rubyslava 2014

  1. 1. @ Rubyslava 2014 Michal Hariš : michal.haris@visualdna.com - Technical Architect, joined VisualDNA in 2012
  2. 2. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles
  3. 3. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture SCALABILITY ISSUES
  4. 4. Where were we 3 years ago ● 10 people working around one mysql table holding 50M+ user profiles ● LAMP Architecture SCALABILITY ISSUES DECISION TO GO BIG (DATA) !
  5. 5. Where were we 18 months ago ● 30 strong team, of that a single tech team of roughly 15 people ● Basically a batch architecture ● ● ● ● ● ● ● just not MySQL but CASSANDRA + HADOOP at the back http+php trackers with piped custom log batch process s3 upload every 5 min daily hdfs distcp POC = daily hadoop inference > 6 node cassandra -> batch integrations POC was a daily batch job which on bad days took 30 hours One of the first commercial Cassandra cluster in the world ● very unstable
  6. 6. Where are we today ● Stack ● Java ● Scala ● Hadoop ● Cassandra ● Kafka ● Redis ● R ● AngularJS for the front-end
  7. 7. Where are we today ● Auto-scaling geo-located Tracker Clusters - well, almost auto-scaling ● Robust Streaming Infrastructure - aggregation of all data streams in central infrastructure ● bringing in 8.5k events/ second at peak ● ● Real-time end-user products, scoring services, integrations with third parties where possible, pre-computation infrastructure that scales more predictively ● These are primary events which get multiplied by various speed-layer ETL Pipeline - offloading data streams and pre-computing materialised views onto HDFS > 30TB of primary data ● ● some data we keep only last 60 or 90 days, others we keep for ever Decision Analytics Pipeline (or RD Pipe) > 100TB+ of secondary data i ● Using feature-extraction machine learning methods
  8. 8. Where are we today ● Still one Cassandra ring, just bigger and more stable, 16 nodes, 250M+ active user profiles ● Lambda Architecture for real-time products like WHY Analytics ● ● ● ● ● RD Pipe is the "batch" layer (daily) that generates active profiles as a cassandra ("view layer") Primary Events are enriched for user profiles produced daily by the Enrichment service ("speed layer") Combination of probabilistic counters and Redis cubes calculates the current audience profiles for subscribed websites ("speed layer") API on top of the Redis cubes serves the current audience profiles for the front end suite of real-time analytics products ("serving layer") Audience Analytics product suite is the good looking bit - http://www. visualdna.com/why/
  9. 9. Where are we today ● 120-strong team, of that tech is roughly 60: ● ● ● ● ● Sysadmin Team Architecture Tech Team Decision Analytics Tech Team Consumer Tech Team WHY Analytics Team
  10. 10. What have we learned ● Architecture: ● Updating json blobs in Cassandra columns is a trap ● Logging is better http://engineering.linkedin.com/distributed-systems/log-what-everysoftware-engineer-should-know-about-real-time-datas-unifying ● ● ● Metrics are crucial in large distributed systems ● yammer metrics + graphite + icinga works well for infrastructure ● but complex event/anomalies detection and pattern analysis gives the edge Real-Time processing of Data Streams is not only cool, but scales well ... until you find a bottleneck in a single component which will limit the entire system Batch still matters ● but could be much faster than Hadoop which falls on too much redundant I/O and requires a coordinated ETL pipeline
  11. 11. What have we learned ● Engineering: ● ● the unix philosophy of building short, simple, clear, modular, and extendable code applies also to a design of distributed systems not just an OS bad tests are better than no tests but they are still bad and most tests only test positive outcome ● the story of Math.abs() -> actually can return negative number -> but none of the unit-tests anticipated this -> which is why metrics and systems with feedback control are crucial ● ● Process: ● ● It is possible to co-operate remotely even on complex and not-well defined systems - atm some of the architecture team is working remotely on permanent basis QA is intrinsic to Architecture and local to products
  12. 12. Interesting issues we’re facing 1. SLAs vs. Start-up dynamics - Separate process (and to some degree architecture) for different levels of guarantee of service 2. Globally-distributed highly-available API for random access to our profiles - enabling decisions based on VDNA profiles on-demand 3. Our Lambda has a bottleneck at the enrichment point - although if we solve (2.) we will be half-way through 4. Complex data pooling attribution model 5. Cassandra still gives us some pain - it's the drivers! - interesting about consistency: http://aphyr.com/posts/294-call-me-maybe-cassandra/ 6. Preserving start-up dynamics and culture in a company of 200+ with offices in several cities
  13. 13. We’re hiring for Bratislava office! ● We’re looking for engineers and analysts and more to be based in Bratislava careers-cee@visualdna.com