If the queue has a backlog, we have to scale up all the systems to keep it moving.
The data is in HDFS, S3, RDBMS, etc.
Analyzing a hive query could be hard, but analyzing the performance metrics of many successful runs should be an easier problem to solve. Users are busy, they write a query and move on. The system should look at the performance and suggest optimizations
One size does not fit all. Customer A's web data could have a very different scale than Customer B
Load Balancer Queue Length CPU and Memory Services up time
Instrumenting your Instruments
INSTRUMENTING YOUR INSTRUMENTS
Co-Founder @ 6sense
Hadoop Summit 2016
What does 6sense do?
How do we do it?
What does the pipeline look like?
Where do we do it?
What are the challenges?
How are we planning to solve them?
WHAT DOES 6SENSE DO?
• We find prospects that are in market to buy
• We empower marketing and sales teams
Account Name Buying Stage Profile Fit
ACME Corporation Purchase Strong
ABC Corp Decision Strong
XYZ Systems Consideration Medium
Doe Inc Awareness Strong
HOW DO WE DO IT?
Data for the
WHAT DOES THE PIPELINE LOOK LIKE?
Ingest Process Export
WHAT AFFECTS PERFORMANCE
─ Non-Partitioned tables
─ File format
─ Data Locality
METRICS THAT MATTER
• # of Mappers
• # of Input Files
• # of Input Records
• # of Records passed on to the next stage
• Time taken in
• # of Reducers
• # of compressed vs uncompressed files
• File formats
WHAT DO WE STORE?
• Job Name 1
─ Date 1
o Yarn Job # 1
o Yarn Job # 2
─ Date 2
o Repeat as above
• Job Name 2
─ Repeat as above
WHAT DO WE USE THEM FOR?
• Finding the Job that
─ Is the slowest
─ Process the most files
─ Filter out most of the data
─ Use the most amount of memory
• Observe trends over time in the above metrics
• Get alerted on changes in the trends, both up and down
• Storage Format
• Compression Type
• Partition Columns
• Which job is causing the bottleneck?
• How many errors can we tolerate?
• Which job is the biggest offender?
• Which job fails the most?
• What did the latest release do?
• Can we scale the number of customers?
• What does it cost to add a customer?
• What does it cost to add a job to each customer’s pipeline?
VENDOR SHOUT OUT
• ClusterK (now AWS Spot Fleet)
─ Allows us to use different instance types to load balance and reduce costs
• Sumo Logic
─ Detect variances in behavior over a custom time period
─ Collects, monitors and alerts on the following metrics
o AWS Cloud Watch metrics (Queue length, S3 bucket size, etc.)
o Host metrics (CPU, Memory, Disk Space, etc.)
o Service metrics (YARN, HBase, Mesos, etc.)
o Container metrics - Docker
o Custom metrics – Anything else you want to send
• premal at 6sense.com