(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014

1. Performance Profiling in Production Analyzing Web Requests at Scale Using MapReduce and Storm Zach Musgrave, Yelp November 12, 2014 | Las Vegas © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

2. Roadmap 1. Why profile your code? 2. Create and analyze profiles 3. Acquire profiles from your webapp 4. Search and sort profiles 5. Aggregate similar profiles together 6. Search, sort, aggregate in real time 7. Future work, extensions, and possibilities

3. In a magical world, far far away… • Our apps never break • Our apps never slow down • Developers think about scalability • All external services run in O(1) • All bugs are known a priori

4. In the world we live in… • Accidents happen • Developers are people • Developers make mistakes • Those mistakes can make it to… production!

5. One Fateful Day…

6. One Fateful Day…

7. One Fateful Day… Crap crap crap Your code makes me sad :( Holy crap Is this even our fault? That’s like a 25% bump Who did this??? Are we timing out? Holy crap What’s the user impact?

10. Enter… the profiler… • Generate deterministic statistics – How many times is a method called? – How long is that method’s runtime? – What’s that method’s name/module? – How much total runtime is devoted? • It’s easy to use ad hoc: – python -m cProfile myscript.py

11. Raw Profile Output ztm@dev7-devb:~$ python -m pstats some-filename-goes-here.profile Welcome to the profile statistics browser. % sort cumulative % callees 10 Ordered by: cumulative time List reduced from 34239 to 10 due to restriction <10> ! ncalls tottime cumtime wsgi/app.py:134(classic_yelp_routing) 1271 0.043 0.208 web/common.py:126(handler_context) 1271 0.102 1790.071 web/wsgi.py:365(execute_request) web/gatekeeper/check.py:226(_handle) 1320 0.037 0.277 visit_captcha.py:58(is_captcha_uri) 1318 0.036 1804.759 emergency_captcha.py:121(__call__) 1318 0.021 0.043 gatekeeper/check.py:223(_should_log_request_timing) web/emergency_captcha.py:90(_handle) 1318 0.016 0.183 visit_captcha.py:58(is_captcha_uri) 1312 0.030 1804.016 web/accesscookies/app.py:118(__call__) 1317 0.020 0.369 web/emergency_captcha.py:67(_should_display_captcha) web/accesscookies/app.py:118(__call__) 1311 0.020 1802.502 pagelet/app.py:37(app) 1312 0.038 0.311 web/accesscookies/app.py:151(should_handle) 1312 0.020 0.058 web/wsgi.py:55(__init__) pagelet/app.py:37(app) 1312 0.030 1803.569 .../pyramid/router.py:242(__call__) 41 0.000 0.009 core/ips.py:48(is_internal_ip) 1312 0.003 0.003 {method 'get' of 'dict' objects}

12. Diff Based on Call Count (n~1,000) ztm@dev7-devb:~$ diff_pstats -s calls several_months_ago.profile recently.profile SORTING BY DELTA IN calls BEFORE AFTER DELTA yelp/util/request_bucketer/bucketer.py:<lambda> 485 3284 2798 ...site-packages/staticconf/proxy.py:method 1967 3524 1557 yelp/util/experiments.py:<genexpr> 231 1620 1389 ...site-packages/simplejson/encoder.py:iterencode 0 1189 1189 yelp/core/encapsulation.py:__new__ 0 1062 1062 Diff Based on Cumulative Runtime (n~1,000) ztm@dev7-devb:~$ diff_pstats -s cum several_months_ago.profile recently.profile SORTING BY DELTA IN cum BEFORE AFTER DELTA yelp/wsgi/tweens.py:tween 1.352045 5.487169 4.135124 yelp/web/gatekeeper/check.py:_handle 0.000000 1.378666 1.378666 yelp/web/emergency_captcha.py:_handle 0.000000 1.378226 1.378226 yelp/util/cheetah/filters.py:markup_filter 0.000000 0.101759 0.101759 yelp/logic/decorators.py:wrapper 0.233577 0.321657 0.088080 yelp/logic/experiments.py:experiments_for_yuv 0.034188 0.120480 0.086293 yelp/util/request_bucketer/bucketer.py:get_bucket 0.049993 0.135661 0.085668

13. BUT HOW DOES THIS WORK IN PRODUCTION?!?! Hi! I’m Daurius, the profiling hedgehog!

15. Get Your Data! • Make a context manager – Wrap your app in it at a high level – Return a profiling context… sometimes • Make a place to put your profiles – We use a distributed logging system, Scribe – You can also save them to a local disk – As long as they eventually go to the cloud! • Add your logging stream to your profiles! – Then you can search for attributes

16. Internet End User Requests Yelp DCs (East Coast) Yelp DCs (West Coast) Scribe Aggregator Scribe Aggregator Upload Scribe to Amazon S3 Upload Scribe to Amazon S3 All your profiling and logging data in one place! S3 Real-time analysis Log tailing System Diagram

17. Webapp Context class CProfileScribeContext(object): """ Context: on exit, save cProfile to Scribe log. """ scribe_category = "cprofile" ! def __enter__(self): self.profiler = cProfile.Profile() self.profiler.enable() ! def __exit__(self, *args): self.profiler.create_stats() write_out = { "cprofile": encode_stats(Stats(self.profiler)), "ranger": ranger.request_info } clog.log_line( self.scribe_category, write_ranger_line(write_out) ) !!

20. Webapp Context Manager class CProfileContextManager(object): ! def should_profile(self, servlet): """ Get the probability for a specific servlet. """ ! cprof_prob = get_config(servlet, DEFAULT) if random.random() < cprof_prob: return True return False ! def get_manager(self, request): """ Return a context manager for the request. """ ! if config.enabled and self.should_profile(servlet): return CProfileScribeContext() return CProfileNoOp()

23. Per-Servlet Configuration • Consider maintaining a config! – Default percentage of requests to profile – Override for specific servlets • Useful for unusual/rarely loaded flows • Reload dynamically with PyStaticConf cprofile: enabled: True probability: default: 0.000X servlets: - home: 0.002X - biz_details: 0.001X

25. Usability is KEY! • Having ~150,000 of anything per day is HARD! • You need to be able to search, sort, and filter • You need to do this quickly – Or it gets stale — less than one day latency – Stale data isn’t (usually) useful! I’m a classy ‘hog… I wanna be FRESH!

26. Enter… $PD]RQEMR!

27. Why Amazon EMR? • EMR lets you run MapReduce jobs in the cloud – How big a cluster? As big as you want! • EMR spins up on demand, too • It’s super easy to use with Python! – Yelp maintains MRJob Mr. Job and I are best buddies!

28. Save Discrete Profile, Logging Files • Process lines of Scribe logs into correct formats – Perfectly parallel – each line is independent! • Save into Amazon S3 – One file for each request’s cProfile – One file for each request’s logging data • Analyze logging data for searchable parameters – Each parameter can be computed in parallel!

29. Parameters Yelp Cares About • WHO: Is the user logged in? • WHAT: Which page did the user access? • site (main, mobile, api, biz site) • servlet (home, biz_details, user_profile) • action (submit, first load, refresh) • WHERE: Which data center? • WHEN: 2014-10-01 T 13:10:53 • HOW: HTTP request (GET, POST, PUT) • HOW LONG: over/under 1 second response

30. Save Discrete Profile, Logging Files class MRScribeTagCprofile(MRJob): ! def mapper(self, _, line): # convert text into dict; convert JSON to a pstats object request = process_ranger_line(line) pstats = decode_stats(request[cprofile]) ! # ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4 save_path = get_basekeyname(request) ! # save pstats and logging info to Amazon S3 bucket.set_object_gz(save_path + .profile.gz, marshal.dumps(pstats.stats) ) bucket.set_object_gz(save_path + .ranger.gz, write_ranger_line(request[ranger]) ) # key examples: datacenter/sfo ; loggedin/True ; servlet/home for key in make_all_matching_tags(request) yield key, save_path

35. Save Discrete Profile, Logging Files class MRScribeTagCprofile(MRJob): def reducer(self, tag_key, matching_paths): # ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo tag_path = tag_path_for(tag_key) ! # get old list of matching values; add new values tag_list = bucket.get_object_gz(tag_path).split(n) # update matching values tag_list.extend(list(matching_paths)) tag_contents = n.join(tag_list) # upload tag file w/ new matching web requests bucket.set_object_gz(tag_path, tag_contents) # output the number of paths per tag we added yield key, len(matching_paths)

41. Aggregate into Multiple Requests • Any single profile doesn’t tell the whole story – If you pick one at random… – There’s no guarantee it’ll show the badness • Create aggregate profiles – Usually one per day, for each set of parameters – Compare daily aggregates to see the big picture It’s hard to see the hedgehogs for the trees!

42. Aggregate into Multiple Requests class MRCprofileCombine(MRJob): def mapper(self, _, pathname): # download the logging info and process it ranger_raw = bucket.get_object_gz(pathname + .ranger.gz) ranger_data = process_ranger_line(ranger_raw) ! # download the cProfile info and process it profile_raw = bucket.get_object_gz(pathname + .profile.gz) stats = pstats.Stats(marshal.loads(profile_raw)) ! # key examples: datacenter/sfo ; loggedin/True ; servlet/home tags = make_all_matching_tags(ranger_data) # generate all 7-ary, ... 1-ary, 0-ary matching paths # 3-ary example: http_method.GET,servlet.biz_details,site.main for path in batch_process_paths(tags): yield path, {ranger: [ranger_data], cprofile: encode_stats(stats), }

47. Aggregate into Multiple Requests • Generating all batch paths is messy – First version looked like this…

48. Aggregate into Multiple Requests • Generating all batch paths is messy – First version looked like this…

49. Aggregate Lnto Multiple Requests class MRCprofileCombine(MRJob): def reducer(self, path_key, entries): combo_pstats = None combo_ranger = [] # Loop over every set of profiles (1 or 1) given for entry in entries: # Add cProfile data together if combo_pstats: combo_pstats.add(decode_stats(entry[cprofile])) else: combo_pstats = decode_stats(entry[cprofile]) # Add logging data together combo_ranger.append(entry[ranger]) ! # See next slide

53. Aggregate Lnto Multiple Requests class MRCprofileCombine(MRJob): def reducer(self, path_key, entries): # See previous slide ! # ex: data/cprofile-processed/batch/2014/10/01/ # date.2014-10-01,http_method.GET,servlet.biz_details,site.main pathname = batch_path(combo_ranger) # save combined cprofile and logging data bucket.set_object_gz(pathname + .profile.gz, marshal.dumps(combo_pstats.stats) ) bucket.set_object_gz(pathname + .ranger.gz, write_ranger_line(combo_ranger) ) yield pathname, len(combo_ranger)

54. Aggregate Lnto Multiple Requests class MRCprofileCombine(MRJob): def reducer(self, path_key, entries): # See previous slide ! # ex: data/cprofile-processed/batch/2014/10/01/ # date.2014-10-01,http_method.GET,servlet.biz_details,site.main pathname = batch_path(combo_ranger) # save combined cprofile and logging data bucket.set_object_gz(pathname + .profile.gz, marshal.dumps(combo_pstats.stats) ) bucket.set_object_gz(pathname + .ranger.gz, write_ranger_line(combo_ranger) ) yield pathname, len(combo_ranger)

55. Aggregate into Multiple Requests class MRCprofileCombine(MRJob): def reducer(self, path_key, entries): # See previous slide ! # ex: data/cprofile-processed/batch/2014/10/01/ # date.2014-10-01,http_method.GET,servlet.biz_details,site.main pathname = batch_path(combo_ranger) # save combined cprofile and logging data bucket.set_object_gz(pathname + .profile.gz, marshal.dumps(combo_pstats.stats) ) bucket.set_object_gz(pathname + .ranger.gz, write_ranger_line(combo_ranger) ) yield pathname, len(combo_ranger)

56. Web Request Profile Me Maybe? Scribe to S3 Nightly MRJob: Upload and tag Nightly MRJob: Aggregate records Amazon S3: - combined profiles, logs - per-attribute tags Ad-hoc MRJob: N-day aggregate Amazon S3: - discrete profiles, logs - per-attribute tags E-mail notify EMR EMR EMR Profilistic service Hi, Daurius! System Diagram Redux

57. Aggregate into Multiple Requests • We have, for every possible combination: • A combined set of profile statistics • A combined set of logging data • Ex: examining {servlet: biz_details} • user logged in; long run! • DC: east; user logged in; long run! • DC: east; HTTP: POST; user logged in; long run! • DC: east; HTTP: POST; site: main; logged in; long run

58. Diff Based on Call Count (n~1,000) ztm@dev7-devb:~$ diff_pstats -s calls several_months_ago.profile recently.profile SORTING BY DELTA IN calls BEFORE AFTER DELTA yelp/util/request_bucketer/bucketer.py:lambda 485 3284 2798 ...site-packages/staticconf/proxy.py:method 1967 3524 1557 yelp/util/experiments.py:genexpr 231 1620 1389 ...site-packages/simplejson/encoder.py:iterencode 0 1189 1189 yelp/core/encapsulation.py:__new__ 0 1062 1062 Diff Based on Cumulative Runtime (n~1,000) ztm@dev7-devb:~$ diff_pstats -s cum several_months_ago.profile recently.profile SORTING BY DELTA IN cum BEFORE AFTER DELTA yelp/wsgi/tweens.py:tween 1.352045 5.487169 4.135124 yelp/web/gatekeeper/check.py:_handle 0.000000 1.378666 1.378666 yelp/web/emergency_captcha.py:_handle 0.000000 1.378226 1.378226 yelp/util/cheetah/filters.py:markup_filter 0.000000 0.101759 0.101759 yelp/logic/decorators.py:wrapper 0.233577 0.321657 0.088080 yelp/logic/experiments.py:experiments_for_yuv 0.034188 0.120480 0.086293 yelp/util/request_bucketer/bucketer.py:get_bucket 0.049993 0.135661 0.085668

59. Storage Considerations, Per Day • 152,814 discrete profile/log records • 40,537 aggregate combinations (0-ary to 7-ary) ! • 386,707 total files created ! • 62.25 GB storage space used (all gzipped) • 40.99 GB on aggregate profiles (without logs) • 21.28 GB on individual profiles/logs

60. Performance Considerations

61. Performance Considerations • Amazon Elastic MapReduce: all units of work should take equal time • This is not the case for our aggregations! • 60%: 10 or fewer profiles • 95%: 1,000 or fewer profiles • 8: over 100,000 profiles

62. Remember Ease of Use? …Remember Daurius? Ooh! It’s my time to shine!

68. Enter… the Storm! • Apache Storm • Real-time distributed computation platform • Directed graph of processing steps (tuples) • Spouts - sources of data - like Scribe! • Bolts - processors of data - like MRJob! • Groupings - define how tuples move between…

69. Pyleus: A Python Framework for Storm Topologies • Pyleus: Yelp’s super new Python Storm bindings • Now open sourced! http://pyleus.org • Build topologies in Python • Declaratively describe structure in YAML • Respects requirements.txt • Compose a topology from Python packaged components!

70. Sample Pyleus Topology name: profilistic workers: 3 topology: ! - spout: name: cprofile-sfo module: yelp_pyleus_util.scribe_spout options: scribe_host: 10.10.10.10 stream: cprofile ! - spout: name: cprofile-iad module: yelp_pyleus_util.scribe_spout options: scribe_host: 10.20.10.10 stream: cprofile

74. Sample Pyleus Topology - bolt: # equivalent of first mapper name: process-ranger module: profilistic.storm.process_ranger groupings: - shuffle_grouping: cprofile-sfo - shuffle_grouping: cprofile-iad ! - bolt: # equiv. of first reducer, plus an S3 cache name: update-tag module: profilistic.storm.update_tag tasks: 6 parallelism_hint: 3 groupings: - fields_grouping: component: process-ranger fields: - tag

77. Sample Pyleus Bolt class MyFirstBolt(pyleus.storm.SimpleBolt): ! def initialize(self): # set up any persistent config resources staticconf.YamlConfiguration( ... ) self.bucket = s3.get_bucket( ... ) ! def process_tuple(self, tup): key, value = tup # do stuff here! ! new_tup = (new_key, new_value) self.emit(new_tup) ! if __name__ == '__main__': MyFirstBolt.run()

81. Profilistic in Pyleus • Profiles used to be one day delayed • Or, in emergencies, an ad hoc midday batch run • Now, ~10 minutes after bad performance… ! • You can investigate!

83. Future Work 1. Active monitoring • For every new aggregation created each day • Pull the same aggregation from 1 day, 1 week ago • DIFF them! • If the delta is too big, send an alert or an e-mail • Easy add-on to end of Pyleus topology

84. Future Work 2. Visualization within the webapp • Already possible ad hoc: graphviz files to PDF • Most recent Yelp hackathon (F’14): Someone built this!

85. How do I… DIY? 1.Wrap the webapp in a context manager 2. Save profiles into the cloud 3. Tag profiles with attributes 4.Combine profiles based on attributes 5. Build quick-’n-dirty internal app to search/filter 6. Refactor it all into Storm? 7. Give the hedgehog a hug! I believe in you! ♥

86. Yelp Dataset Challenge Academic dataset from Phoenix, Las Vegas, Madison, Waterloo and Edinburgh! ! ! ! + Your academic project, research and/or visualizations submitted by December 31, 2014 = $5,000 prize + $1,000 for publication + $500 for presenting* yelp.com/dataset_challenge *See full terms on website ● 1,125,458 Reviews ● 42,153 Businesses ○ 320,002 Business attributes ● 403,210 Tips !! ● 252,898 Users ○ 955,999 Edge social graph ● 31,617 Checkin Sets

87. Thanks for listening! Don’t be a stranger! ! ! Python MapReduce package: ! Python Storm package: ztm@yelp.com ! ! http://mrjob.org ! http://pyleus.org

88. Please give us your feedback on this presentation Join the conversation on Twitter with #reinvent BDT402 © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014

Similar to (BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014