Your SlideShare is downloading. ×
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Delivering Performance Under Schedule and Resource Pressure: Lessons Learned at Google and Microsoft

519

Published on

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1oEGtyD. …

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1oEGtyD.

Ivan Filho shares lessons learned during the development and release of several large scale services at Microsoft and Google from the perspective of a performance manager. Filmed at qconsf.com.

Ivan Santa Maria Filho is currently the performance technical lead for Google Cloud, and his prior experience includes several large releases, including Bing.com and SQL Azure.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
519
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Delivering Performance Under Resource and Schedule Pressure Ivan Santa Maria Filho Google Cloud Performance TL
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /performance-manager-googlemicrosoft InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Disclaimer While I use what I know to do my job, the contents of this presentation reflect only my opinion, not those of my present or past employers.
  • 5. The formal performance cycle 1. 2. 3. 4. 5. Formulate an hypothesis Develop a prototype Validate findings Integrate improvements Repeat This is the cycle you want
  • 6. The real performance cycle ● ● ● ● ● ● If the numbers look good: You’re a genius! Done. If the numbers look bad: The scenarios are wrong If scenarios are right: The methodology is flawed If the methodology is proven: Buggy implementation If the implementation is proven: It is too late to change If the team leadership decides to hold the release: It’s been so long the scenarios are now outdated This is the cycle you start with
  • 7. The nature of performance work There is no standing still. You are either moving forward or falling.
  • 8. The nature of performance work Developers never stop writing code. Make performance improvement the default activity.
  • 9. The nature of performance work Don’t dig a performance hole. If you’re in one then start to get out immediately.
  • 10. Delivering Under Pressure Make it a team problem Make it the team’s second nature “A team is a group of people with a common goal, where every single member is necessary to accomplish that goal, everyone knows their role, and everyone know each other’s role”
  • 11. Leading to a positive cycle Organizational and personal styles ● ● ● ● Telling: Directly tell people what to do Selling: Influence the team or key stakeholders Participating: Share the decision-making Delegating: Trust other leaders, but monitor progress Scope of influence ● ● Senior leadership, managers, and engineers lead differently The need to explain the why, what, when, where, and how remains Organization maturity ● ● ● High: Capable and confident Moderate: Capable but unwilling; Unable but willing Low: Unable and insecure
  • 12. Why Create product capability and competitive matrices ● Enterprise products succeed because of either money savings or enabling new things ● Consumer products also succeed because they are “cool” ○ Consumers might, over a long time, change the enterprise ● Enumerate all product capabilities ● Keywords ○ Direct: “throughput”, and “latency” ○ Indirect: “fluid”, “natural”, and “amazing”
  • 13. What Create product backlog and metrics ● ● Enumerate all known product features Define how to measure success for each one - need to know when to stop Great performance metrics are ● ● ● ● Few and memorable Intentional, purposeful, and consequential at all levels Measurable and actionable Target the competition or raise a competitor barrier of entrance
  • 14. Types of performance metrics Strategic Tactical Foundational Type Competitive Customer Scenarios Micro-benchmarks Owner Business owner Dev manager, director Engineers Example TPC or YSCB TestDFSIO FIO 4KB random reads Notable performance metrics ● ● ● Awesomeness: Boeing 737 passenger throughput Unintended consequences: FAA departure on time “It’s complicated”: MS-SQL Server Replication time to resolution
  • 15. Metric collection methodology Collection methodology matters 1. 2. 3. 4. 5. 6. Prepare cluster Deploy product and background data Warm-up period Run benchmark Typically all you see Cool-down period Turndown cluster Understand what you want to know ● ● Ignoring bootstrap/turndown on cloud deployments Ignoring perceived versus actual performance on user interfaces
  • 16. Observable performance vs. SLAs Service Level Agreements (SLAs) ● ● ● ● ● Contractual obligations between provider and consumer Clear unit of measurement and collection methodology Reporting cards Validation and Remediation mechanisms Escalation path on violations Observable performance ● ● How the product behaved lately - no promises, no guarantees What most customers take dependencies on
  • 17. When Define meaningful deadlines ● ● Tied to “why” the product exist and “what” defines success The team knows what they want to learn each time Deadlines should improve the team maturity ● ● ● ● Creates the habit to measure progress against metrics Checkpoint to refine metrics Opportunity to refine estimation and learn the team reaction time Post-mortems should bring data insights, not data dumps Deadlines should give you something to celebrate
  • 18. How Continuously identify cost structure improvements ● ● Model the cost structure: CapEx and OpEx ○ How does cost grow as the user base grows? ○ How does cost grow as the background data grows? ○ How does cost grow as engagement grows? Enumerate risk factors and dependencies ○ How does cost grow as your cloud provider price sheet changes? Having an opinion does not equate knowing ● ● Create price/performance models using product metrics Highly recommended blog: http://perspectives.mvdirona.com/
  • 19. Where Determine market specific metrics ● ● ● Supplier latencies: Compare to SLAs and product needs Local user devices: how fast are the devices? Supplier monopolies and regulatory environment Validate new markets against performance metrics “On the Cloud there are three key elements to performance: Location, location, and location” - Anthony F. Voellm (my boss)
  • 20. Selected network latencies Average query latencies to Google’s BigQuery (in ms) M-LAB raw data: https://code.google.com/p/m-lab/wiki/PDEChartsNDT
  • 21. Recognizing good performance work Tell signs ● ● ● ● ● Aligned in spirit with “why” the product exists Tied to “what” the product delivers through a latency or throughput metric Scope of work contained by “when” it should be available Provide structural advantages on “how” the product is built (TCO is a plus) Has leftover assets in case of failure It is harder to spot than you might think ● ● ● Data insight is non-obvious by definition You might not want to share some decisions It might be about someone failing and learning
  • 22. Non-obvious example - Operations “I’d like to automate our deployment process” ● ● ● Performance metrics might not include bootstrap times Potentially disruptive to operations Might change how your company negotiate with suppliers But... ● ● ● ● If it takes 1 DBA for every 65 servers (or 65k apps), how many to run a hosted Database service? Creative employees do not like manual work - retention Not everyone like to carry pagers - hiring If “it takes a village” to run your business you might have to invade one: ○ Microsoft’s 99k+ employees to Google’s 30k+ ○ GroupOn’s 10k+ employees to Netflix’s 2k+
  • 23. Even less obvious consequences Google operations as structural advantage ● ● ● ● A Warehouse Computer is not a building full of computers End to end design: Cooling, power, layout, network, … Cost control from start: Capex, repairs, deployments, ... Start simple and iterate A different way to do it ● ● ● ● Tireless study of market and competitors Concentrate on software only Leverage the market and partners Start with a commodity solution and work with OEMs
  • 24. Non obvious example - Power costs “I’d like use energy prices to guide storage locations” ● ● ● Might rent space from a cloud provider Who cares? It likely hurts tail latency Migrating data is a pain But... ● ● ● Being able to migrate helps dealing with supplier pricing Scaling out, not up, is a more likely model for Cloud It might be a learning opportunity
  • 25. Power costs look promising... Important planning metric for Cloud providers ● Determines which servers are available ○ Expensive energy: Newer servers ○ Cheap energy: Older servers if real estate prices and SLAs allow Google and power management ● ● ● ● ● Plans using thermal modeling Raises the thermostat (to 80°F) Manages airflow on the cheap: Big slow fans, cooling towers, drippers, isolate components that run hotter, … Actively manages available power and power footprint Became an energy trader
  • 26. … But then the network gets you The challenge with storage today is access, not volume. HDD Storage Historical Prices $/GB Storage pricing (GB) ● Commodity HDD: $0.04 ● AWS: $0.037 to $0.095 ● Google: $0.042 and $0.085 $437K/GB $0.04/GB Access Pricing (egress, GB) ● AWS: $0 to $0.12, to “call us” ● Google: $0 to $0.15, to “call us” ○ Charges per operation ● Ingress almost always free
  • 27. Notes on large distributed storage Lessons from Google ● ● ● Rare performance problems affect a significant fraction of all requests Eliminating all sources of latency variability is impractical Tail-tolerant techniques make a predictable whole from less predictable parts Complicated systems interact in complicated ways ● ● ● ● ● ● Global resources (switches and shared file systems) Shared resources (locks, cores, memory and net bandwidth) Daemons and background tasks Maintenance (data reconstruction, log compactions, SSD garbage collection) Power limits and management enforced by CPU and rack Network latencies
  • 28. Non obvious - “unexpected” assets While chasing power costs ● ● ● ● Learned how to move data around Understood tail latencies and throughput (observable and SLAs) Learned how to route requests around data moves Developed good storage performance So now you can ● ● ● ● Improve the product by adding high tail latency tolerance features Move your data around when necessary Scale-out better Manage your network traffic around peak demand
  • 29. BigTable read latencies Published at Source article: http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext
  • 30. A note on Order of Growth analysis 1 get a positive integer from input Runs once: T1 2 if n > 10 Runs once: T2 3 4 5 6 7 print "This might take a while..." for i = 1 to n for j = 1 to n print i * j print "Done!" Runs maybe once: T3 Runs n+1 times => T4∙(n+1) Runs n+1 times => T5∙(n)∙(n+1) Runs (n)(n) times => T6∙(n)∙(n) Runs once: T7 T1+T2+T3+T4∙(n+1)+T5∙(n)∙(n+1)+T6∙(n)∙(n)+T7 ⇒ T6∙n2+T5∙n2+T5∙n+T4∙n+T1+T2+T3+T4+T7 ⇒ The algorithm has O(n2), but what about T?
  • 31. Values of T you learn over time 1 GHz CPU clock (1ns) 1MB TCP tx, 10Gbps card nominal speed (1.1 ms) Photon travels 5,000Km in fiber (5ms) 1MB TCP tx, 1Gbps card nominal speed (5ms) Human perceived delays (~100ms) Human eye blink (~350ms) MySQL 5.5 default timeout (5s) Alerting troops 200 miles out, Tang dynasty (1h) Duration 4ns 17ns 82us 2ms 4ms 5ms 5ms 16ms 200ms What L2 cache reference Mutex lock/unlock RAM copy 1MB (core i7-2600) SSD 1MB sequential read Disk seek HDD 1MB sequential read Copy 1 MB over network - same DC Copy 1 MB over network - same region Download 1 MB (fast US ISP) 2.1s Download 1 MB (slow US ISP) 1h Download 1 MB (slow US ISP and slow device)
  • 32. Total team effort To deliver good performance under pressure you need the whole team involved, and it must be their second nature to do so ● ● ● Learn or define what the team values Reward successful milestones on their currency Make performance work easier ○ ○ ○ ○ Control architecture complexity Make data readily available for analysis Make it easy to add more data Reward developing and sharing tools
  • 33. Know or define the team currency Companies have currencies ● ● What justifies a promotion? What makes one shine during performance reviews? People have currencies ● ● ● ● ● Cash, equity, and stock Technical challenges Reputation Sense of purpose Power
  • 34. Common “rewards” Currency Stick Carrot Money Cut pay, fire Money and equity, spot bonuses Logarithmic returns after a “sanitizing level” Technical challenge Assign menial tasks Ask help with structural changes Reputation Public shaming Public acknowledgement, honest flattery Technical Leader designation As simple as a certificate or small gift card Sense of purpose Show apathy towards their work Show how their work helps the team mission and product value proposition
  • 35. Simple improvements Instrument your product ● ● ● Logs over profilers Decide on logging formats and routines early Implement distributed correlation IDs Make product milestones work ● ● ● Instrument test and deployment tools to collect performance data “It only works if it works in production” mentality ○ “Any sufficiently advanced technology is indistinguishable from magic” - Arthur C. Clarke Design A/B testing and canary deployments
  • 36. Final remarks ● There is no standing still ● Learn to measure what matters and let everyone know ○ ○ Avoid the “smart talk trap”, avoid debating abstractions Data insights are the greatest sanitizer ● Make performance a whole team effort ○ Assign owners to all metrics ● Reward structural, quantifiable improvements ● Use the team currency ● Make the work easier by removing blockers
  • 37. Q&A ● Ivan Santa Maria Filho (ivansmf@google.com)
  • 38. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations/performance -manager-google-microsoft

×