Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Seconds, Jags Srawan, Engineer, Interana


Published on

Interana is a full stack analytics solution that provides lightening fast querying capabilities using a proprietary storage format. Interana was designed to utilize best of both in-memory and disk architectures. This talk serves as an introduction to concepts on event data and utilizing advanced behavior analysis built into Interana. The attendee will gain knowledge about how to model their data effectively using our full service solution.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Seconds, Jags Srawan, Engineer, Interana

  1. 1. Interana Puree through Trillions of clicks in seconds
  2. 2. Agenda Behaviour Queries Fast Data Ingest Deployment
  3. 3. Who Am I Big Data Engineer at Interana SQL, Cassandra, Redis, Mongo, SOLR and now Interana If u want create a big problem, build a database, if u want to create huge problem, never delete anything
  4. 4. Data > Opinion Journey ● Three Engineers Lior Abraham, Bobby Johnson and Ann Johnson ● Scuba at Facebook ● Take it to the masses ● Full Stack Solution - UI/API, Ingest and Storage Tier
  5. 5. Customers
  6. 6. Use Cases ● Dynamic Session ● Purchase/Ad Funnels ● Continuous deployment - A/B ● Bot Detection Want answers over TRILLIONS of data points in REALTIME
  7. 7. What is Behaviour
  8. 8. Concepts Event Stream Event - Actor - Behavior - At Time T, Jack makes purchase Session - Between Login and Logout, inactive time Cohort - Male, 25, California, In-Out-Burger Funnel - Click On Ad => Viewed Item => Add to Cart => Made Purchase => Satisfaction Metric - Equation column from existing, storageless
  9. 9. Time ordered Event Data Timestamp User (SK) Ad_id (SK) Behaviour Is_Alcoholic (DM) July 1st Jack Beer Clicked on Ad True July 1st Jill Juice Clicked on Add False July 2nd Jack Added To Cart July 3rd Jack Purchase July 5th Jill Added To Cart July 10th Jill Logged Out
  10. 10. Performance, Performance, Performance ● Columnar Store design for fast scanning ● C++/MMAP ● Pipelining ● Compression
  11. 11. I/O - keep it near the core
  12. 12. Sampling ● Lies, Damn Lies and Sampling ● Sampling take advantage of SK <-> Actor relationship ● Confidence depends on the shape of the distributions ● Sample rate is key. Sparse data get tricky across 100’s of shards
  13. 13. Sampling
  14. 14. Ingest
  15. 15. Ingest ● Schemaless, evolving organically ● Transformers and Pipelines ● Pull(S3, Blob, FS) or Push (HTTP) ● Dedupe and Replay friendly.
  16. 16. Columnar Format
  17. 17. Operational Approach ● Managed Service - Hosted in AWS/Azure environment ● Coming Soon - Container based solution (AMI), self-server Import as well ● Performance is critical - Sata vs SSD, Tiered Disks, RAM ● Redundancy - currently no live redundancy, use backup and playback
  18. 18. The Cluster
  19. 19. Typical Performance Sampled Unsampled Sampled/ Tiered Count * 2s 11s 20s Group/Filte r 2s 17s 10s Session duration 5s 45s 10s Funnel 10s 60s 20s Ex. AWS - 32 * I2.xlarge - 800 GB SSD + 1.6 TB Tiered = 70 TB of storage 2 Trillion Rows * 1000 columns Peak Throughput 500 MB/s Lines 10 M Rows/min Latency 5-20 minutes Query Import
  20. 20. Explorer
  21. 21. Music Data Set - 4B Rows ts 1412210780000 userId(sk) 130065 sessionId(sk) 999FAFD51ASD artist Audioslave auth Logged In lastName Brown level free page NextSong song Gasoline
  22. 22. Demo ● Music Data Set - 4B ● Dashboards ● Explorer ● Funnels
  23. 23. Thanks For more tech advice come to our blog