Code profiling gives a rich, detailed view of runtime performance. However, it's difficult to achieve in production: for even a small fraction of web requests, huge challenges in scalability, access, and ease of use appear. Despite this, Yelp profiles a nontrivial fraction of its traffic by combining Amazon EC2, Amazon EMR, and Amazon S3. Developers can search, sort, filter, and combine interesting profiles; during a site slowdown or page failure, this allows a fast diagnosis and speedy recovery. Some of our analyses run nightly, while others run in real-time via Storm topologies. This session includes our use cases for code profiling, its benefits, and the implementation of its handlers and analysis flows. We include both performance results and implementation challenges of our MapReduce and Storm jobs, including code overviews. We also touch on issues such as concurrent logging, cross-data center replication, job scheduling, and API definitions.
Similar to (BDT402) Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm | AWS re:Invent 2014 (20)
2. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
3. In a magical world, far far away…
• Our apps never break
• Our apps never slow down
• Developers think about scalability
• All external services run in O(1)
• All bugs are known a priori
4. In the world we live in…
• Accidents happen
• Developers are people
• Developers make mistakes
• Those mistakes can make it to… production!
7. One Fateful Day…
Crap crap crap
Your code makes me sad :(
Holy crap
Is this even our fault?
That’s like a 25% bump
Who did this???
Are we timing out?
Holy crap What’s the user impact?
8.
9. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
10. Enter… the profiler…
• Generate deterministic statistics
– How many times is a method called?
– How long is that method’s runtime?
– What’s that method’s name/module?
– How much total runtime is devoted?
• It’s easy to use ad hoc:
– python -m cProfile myscript.py
12. Diff Based on Call Count (n~1,000)
ztm@dev7-devb:~$ diff_pstats -s calls several_months_ago.profile recently.profile
SORTING BY DELTA IN calls BEFORE AFTER DELTA
yelp/util/request_bucketer/bucketer.py:<lambda> 485 3284 2798
...site-packages/staticconf/proxy.py:method 1967 3524 1557
yelp/util/experiments.py:<genexpr> 231 1620 1389
...site-packages/simplejson/encoder.py:iterencode 0 1189 1189
yelp/core/encapsulation.py:__new__ 0 1062 1062
Diff Based on Cumulative Runtime (n~1,000)
ztm@dev7-devb:~$ diff_pstats -s cum several_months_ago.profile recently.profile
SORTING BY DELTA IN cum BEFORE AFTER DELTA
yelp/wsgi/tweens.py:tween 1.352045 5.487169 4.135124
yelp/web/gatekeeper/check.py:_handle 0.000000 1.378666 1.378666
yelp/web/emergency_captcha.py:_handle 0.000000 1.378226 1.378226
yelp/util/cheetah/filters.py:markup_filter 0.000000 0.101759 0.101759
yelp/logic/decorators.py:wrapper 0.233577 0.321657 0.088080
yelp/logic/experiments.py:experiments_for_yuv 0.034188 0.120480 0.086293
yelp/util/request_bucketer/bucketer.py:get_bucket 0.049993 0.135661 0.085668
13. BUT HOW DOES THIS WORK IN
PRODUCTION?!?!
Hi! I’m Daurius, the
profiling hedgehog!
14. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
15. Get Your Data!
• Make a context manager
– Wrap your app in it at a high level
– Return a profiling context… sometimes
• Make a place to put your profiles
– We use a distributed logging system, Scribe
– You can also save them to a local disk
– As long as they eventually go to the cloud!
• Add your logging stream to your profiles!
– Then you can search for attributes
16. Internet
End User
Requests Yelp DCs
(East Coast)
Yelp DCs
(West Coast)
Scribe Aggregator
Scribe
Aggregator
Upload Scribe to
Amazon S3
Upload Scribe to
Amazon S3
All your profiling
and logging data
in one place!
S3
Real-time analysis
Log tailing
System Diagram
20. Webapp Context Manager
class CProfileContextManager(object):
!
def should_profile(self, servlet):
""" Get the probability for a specific servlet. """
!
cprof_prob = get_config(servlet, DEFAULT)
if random.random() < cprof_prob:
return True
return False
!
def get_manager(self, request):
""" Return a context manager for the request. """
!
if config.enabled and self.should_profile(servlet):
return CProfileScribeContext()
return CProfileNoOp()
21. Webapp Context Manager
class CProfileContextManager(object):
!
def should_profile(self, servlet):
""" Get the probability for a specific servlet. """
!
cprof_prob = get_config(servlet, DEFAULT)
if random.random() < cprof_prob:
return True
return False
!
def get_manager(self, request):
""" Return a context manager for the request. """
!
if config.enabled and self.should_profile(servlet):
return CProfileScribeContext()
return CProfileNoOp()
22. Webapp Context Manager
class CProfileContextManager(object):
!
def should_profile(self, servlet):
""" Get the probability for a specific servlet. """
!
cprof_prob = get_config(servlet, DEFAULT)
if random.random() < cprof_prob:
return True
return False
!
def get_manager(self, request):
""" Return a context manager for the request. """
!
if config.enabled and self.should_profile(servlet):
return CProfileScribeContext()
return CProfileNoOp()
23. Per-Servlet Configuration
• Consider maintaining a config!
– Default percentage of requests to profile
– Override for specific servlets
• Useful for unusual/rarely loaded flows
• Reload dynamically with PyStaticConf
cprofile:
enabled: True
probability:
default: 0.000X
servlets:
- home: 0.002X
- biz_details: 0.001X
24. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
25. Usability is KEY!
• Having ~150,000 of anything per day is HARD!
• You need to be able to search, sort, and filter
• You need to do this quickly
– Or it gets stale — less than one day latency
– Stale data isn’t (usually) useful!
I’m a classy ‘hog…
I wanna be FRESH!
27. Why Amazon EMR?
• EMR lets you run MapReduce jobs in the cloud
– How big a cluster? As big as you want!
• EMR spins up on demand, too
• It’s super easy to use with Python!
– Yelp maintains MRJob
Mr. Job and I
are best buddies!
28. Save Discrete Profile, Logging Files
• Process lines of Scribe logs into correct formats
– Perfectly parallel – each line is independent!
• Save into Amazon S3
– One file for each request’s cProfile
– One file for each request’s logging data
• Analyze logging data for searchable parameters
– Each parameter can be computed in parallel!
29. Parameters Yelp Cares About
• WHO: Is the user logged in?
• WHAT: Which page did the user access?
• site (main, mobile, api, biz site)
• servlet (home, biz_details, user_profile)
• action (submit, first load, refresh)
• WHERE: Which data center?
• WHEN: 2014-10-01 T 13:10:53
• HOW: HTTP request (GET, POST, PUT)
• HOW LONG: over/under 1 second response
30. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
!
def mapper(self, _, line):
# convert text into dict; convert JSON to a pstats object
request = process_ranger_line(line)
pstats = decode_stats(request[cprofile])
!
# ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4
save_path = get_basekeyname(request)
!
# save pstats and logging info to Amazon S3
bucket.set_object_gz(save_path + .profile.gz,
marshal.dumps(pstats.stats) )
bucket.set_object_gz(save_path + .ranger.gz,
write_ranger_line(request[ranger]) )
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
for key in make_all_matching_tags(request)
yield key, save_path
31. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
!
def mapper(self, _, line):
# convert text into dict; convert JSON to a pstats object
request = process_ranger_line(line)
pstats = decode_stats(request[cprofile])
!
# ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4
save_path = get_basekeyname(request)
!
# save pstats and logging info to Amazon S3
bucket.set_object_gz(save_path + .profile.gz,
marshal.dumps(pstats.stats) )
bucket.set_object_gz(save_path + .ranger.gz,
write_ranger_line(request[ranger]) )
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
for key in make_all_matching_tags(request)
yield key, save_path
32. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
!
def mapper(self, _, line):
# convert text into dict; convert JSON to a pstats object
request = process_ranger_line(line)
pstats = decode_stats(request[cprofile])
!
# ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4
save_path = get_basekeyname(request)
!
# save pstats and logging info to Amazon S3
bucket.set_object_gz(save_path + .profile.gz,
marshal.dumps(pstats.stats) )
bucket.set_object_gz(save_path + .ranger.gz,
write_ranger_line(request[ranger]) )
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
for key in make_all_matching_tags(request)
yield key, save_path
33. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
!
def mapper(self, _, line):
# convert text into dict; convert JSON to a pstats object
request = process_ranger_line(line)
pstats = decode_stats(request[cprofile])
!
# ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4
save_path = get_basekeyname(request)
!
# save pstats and logging info to Amazon S3
bucket.set_object_gz(save_path + .profile.gz,
marshal.dumps(pstats.stats) )
bucket.set_object_gz(save_path + .ranger.gz,
write_ranger_line(request[ranger]) )
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
for key in make_all_matching_tags(request)
yield key, save_path
34. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
!
def mapper(self, _, line):
# convert text into dict; convert JSON to a pstats object
request = process_ranger_line(line)
pstats = decode_stats(request[cprofile])
!
# ex: logs/cprofile-discrete/2014/10/01/00:01:34-3fc2d016d8accaf4
save_path = get_basekeyname(request)
!
# save pstats and logging info to Amazon S3
bucket.set_object_gz(save_path + .profile.gz,
marshal.dumps(pstats.stats) )
bucket.set_object_gz(save_path + .ranger.gz,
write_ranger_line(request[ranger]) )
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
for key in make_all_matching_tags(request)
yield key, save_path
35. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
def reducer(self, tag_key, matching_paths):
# ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo
tag_path = tag_path_for(tag_key)
!
# get old list of matching values; add new values
tag_list = bucket.get_object_gz(tag_path).split(n)
# update matching values
tag_list.extend(list(matching_paths))
tag_contents = n.join(tag_list)
# upload tag file w/ new matching web requests
bucket.set_object_gz(tag_path, tag_contents)
# output the number of paths per tag we added
yield key, len(matching_paths)
36. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
def reducer(self, tag_key, matching_paths):
# ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo
tag_path = tag_path_for(tag_key)
!
# get old list of matching values; add new values
tag_list = bucket.get_object_gz(tag_path).split(n)
# update matching values
tag_list.extend(list(matching_paths))
tag_contents = n.join(tag_list)
# upload tag file w/ new matching web requests
bucket.set_object_gz(tag_path, tag_contents)
# output the number of paths per tag we added
yield key, len(matching_paths)
37. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
def reducer(self, tag_key, matching_paths):
# ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo
tag_path = tag_path_for(tag_key)
!
# get old list of matching values; add new values
tag_list = bucket.get_object_gz(tag_path).split(n)
# update matching values
tag_list.extend(list(matching_paths))
tag_contents = n.join(tag_list)
# upload tag file w/ new matching web requests
bucket.set_object_gz(tag_path, tag_contents)
# output the number of paths per tag we added
yield key, len(matching_paths)
38. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
def reducer(self, tag_key, matching_paths):
# ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo
tag_path = tag_path_for(tag_key)
!
# get old list of matching values; add new values
tag_list = bucket.get_object_gz(tag_path).split(n)
# update matching values
tag_list.extend(list(matching_paths))
tag_contents = n.join(tag_list)
# upload tag file w/ new matching web requests
bucket.set_object_gz(tag_path, tag_contents)
# output the number of paths per tag we added
yield key, len(matching_paths)
39. Save Discrete Profile, Logging Files
class MRScribeTagCprofile(MRJob):
def reducer(self, tag_key, matching_paths):
# ex: logs/cprofile-discrete/2014/10/01/tags/datacenter/sfo
tag_path = tag_path_for(tag_key)
!
# get old list of matching values; add new values
tag_list = bucket.get_object_gz(tag_path).split(n)
# update matching values
tag_list.extend(list(matching_paths))
tag_contents = n.join(tag_list)
# upload tag file w/ new matching web requests
bucket.set_object_gz(tag_path, tag_contents)
# output the number of paths per tag we added
yield key, len(matching_paths)
40. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
41. Aggregate into Multiple Requests
• Any single profile doesn’t tell the whole story
– If you pick one at random…
– There’s no guarantee it’ll show the badness
• Create aggregate profiles
– Usually one per day, for each set of parameters
– Compare daily aggregates to see the big picture
It’s hard to see the
hedgehogs for the trees!
42. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def mapper(self, _, pathname):
# download the logging info and process it
ranger_raw = bucket.get_object_gz(pathname + .ranger.gz)
ranger_data = process_ranger_line(ranger_raw)
!
# download the cProfile info and process it
profile_raw = bucket.get_object_gz(pathname + .profile.gz)
stats = pstats.Stats(marshal.loads(profile_raw))
!
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
tags = make_all_matching_tags(ranger_data)
# generate all 7-ary, ... 1-ary, 0-ary matching paths
# 3-ary example: http_method.GET,servlet.biz_details,site.main
for path in batch_process_paths(tags):
yield path, {ranger: [ranger_data],
cprofile: encode_stats(stats),
}
43. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def mapper(self, _, pathname):
# download the logging info and process it
ranger_raw = bucket.get_object_gz(pathname + .ranger.gz)
ranger_data = process_ranger_line(ranger_raw)
!
# download the cProfile info and process it
profile_raw = bucket.get_object_gz(pathname + .profile.gz)
stats = pstats.Stats(marshal.loads(profile_raw))
!
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
tags = make_all_matching_tags(ranger_data)
# generate all 7-ary, ... 1-ary, 0-ary matching paths
# 3-ary example: http_method.GET,servlet.biz_details,site.main
for path in batch_process_paths(tags):
yield path, {ranger: [ranger_data],
cprofile: encode_stats(stats),
}
44. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def mapper(self, _, pathname):
# download the logging info and process it
ranger_raw = bucket.get_object_gz(pathname + .ranger.gz)
ranger_data = process_ranger_line(ranger_raw)
!
# download the cProfile info and process it
profile_raw = bucket.get_object_gz(pathname + .profile.gz)
stats = pstats.Stats(marshal.loads(profile_raw))
!
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
tags = make_all_matching_tags(ranger_data)
# generate all 7-ary, ... 1-ary, 0-ary matching paths
# 3-ary example: http_method.GET,servlet.biz_details,site.main
for path in batch_process_paths(tags):
yield path, {ranger: [ranger_data],
cprofile: encode_stats(stats),
}
45. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def mapper(self, _, pathname):
# download the logging info and process it
ranger_raw = bucket.get_object_gz(pathname + .ranger.gz)
ranger_data = process_ranger_line(ranger_raw)
!
# download the cProfile info and process it
profile_raw = bucket.get_object_gz(pathname + .profile.gz)
stats = pstats.Stats(marshal.loads(profile_raw))
!
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
tags = make_all_matching_tags(ranger_data)
# generate all 7-ary, ... 1-ary, 0-ary matching paths
# 3-ary example: http_method.GET,servlet.biz_details,site.main
for path in batch_process_paths(tags):
yield path, {ranger: [ranger_data],
cprofile: encode_stats(stats),
}
46. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def mapper(self, _, pathname):
# download the logging info and process it
ranger_raw = bucket.get_object_gz(pathname + .ranger.gz)
ranger_data = process_ranger_line(ranger_raw)
!
# download the cProfile info and process it
profile_raw = bucket.get_object_gz(pathname + .profile.gz)
stats = pstats.Stats(marshal.loads(profile_raw))
!
# key examples: datacenter/sfo ; loggedin/True ; servlet/home
tags = make_all_matching_tags(ranger_data)
# generate all 7-ary, ... 1-ary, 0-ary matching paths
# 3-ary example: http_method.GET,servlet.biz_details,site.main
for path in batch_process_paths(tags):
yield path, {ranger: [ranger_data],
cprofile: encode_stats(stats),
}
47. Aggregate into Multiple Requests
• Generating all batch paths is messy
– First version looked like this…
48. Aggregate into Multiple Requests
• Generating all batch paths is messy
– First version looked like this…
49. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
combo_pstats = None
combo_ranger = []
# Loop over every set of profiles (1 or 1) given
for entry in entries:
# Add cProfile data together
if combo_pstats:
combo_pstats.add(decode_stats(entry[cprofile]))
else:
combo_pstats = decode_stats(entry[cprofile])
# Add logging data together
combo_ranger.append(entry[ranger])
!
# See next slide
50. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
combo_pstats = None
combo_ranger = []
# Loop over every set of profiles (1 or 1) given
for entry in entries:
# Add cProfile data together
if combo_pstats:
combo_pstats.add(decode_stats(entry[cprofile]))
else:
combo_pstats = decode_stats(entry[cprofile])
# Add logging data together
combo_ranger.append(entry[ranger])
!
# See next slide
51. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
combo_pstats = None
combo_ranger = []
# Loop over every set of profiles (1 or 1) given
for entry in entries:
# Add cProfile data together
if combo_pstats:
combo_pstats.add(decode_stats(entry[cprofile]))
else:
combo_pstats = decode_stats(entry[cprofile])
# Add logging data together
combo_ranger.append(entry[ranger])
!
# See next slide
52. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
combo_pstats = None
combo_ranger = []
# Loop over every set of profiles (1 or 1) given
for entry in entries:
# Add cProfile data together
if combo_pstats:
combo_pstats.add(decode_stats(entry[cprofile]))
else:
combo_pstats = decode_stats(entry[cprofile])
# Add logging data together
combo_ranger.append(entry[ranger])
!
# See next slide
53. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
# See previous slide
!
# ex: data/cprofile-processed/batch/2014/10/01/
# date.2014-10-01,http_method.GET,servlet.biz_details,site.main
pathname = batch_path(combo_ranger)
# save combined cprofile and logging data
bucket.set_object_gz(pathname + .profile.gz,
marshal.dumps(combo_pstats.stats)
)
bucket.set_object_gz(pathname + .ranger.gz,
write_ranger_line(combo_ranger)
)
yield pathname, len(combo_ranger)
54. Aggregate Lnto Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
# See previous slide
!
# ex: data/cprofile-processed/batch/2014/10/01/
# date.2014-10-01,http_method.GET,servlet.biz_details,site.main
pathname = batch_path(combo_ranger)
# save combined cprofile and logging data
bucket.set_object_gz(pathname + .profile.gz,
marshal.dumps(combo_pstats.stats)
)
bucket.set_object_gz(pathname + .ranger.gz,
write_ranger_line(combo_ranger)
)
yield pathname, len(combo_ranger)
55. Aggregate into Multiple Requests
class MRCprofileCombine(MRJob):
def reducer(self, path_key, entries):
# See previous slide
!
# ex: data/cprofile-processed/batch/2014/10/01/
# date.2014-10-01,http_method.GET,servlet.biz_details,site.main
pathname = batch_path(combo_ranger)
# save combined cprofile and logging data
bucket.set_object_gz(pathname + .profile.gz,
marshal.dumps(combo_pstats.stats)
)
bucket.set_object_gz(pathname + .ranger.gz,
write_ranger_line(combo_ranger)
)
yield pathname, len(combo_ranger)
56. Web Request
Profile Me Maybe?
Scribe to S3
Nightly MRJob:
Upload and tag
Nightly MRJob:
Aggregate records
Amazon S3:
- combined profiles, logs
- per-attribute tags
Ad-hoc MRJob:
N-day aggregate
Amazon S3:
- discrete profiles, logs
- per-attribute tags
E-mail notify
EMR EMR
EMR
Profilistic service
Hi, Daurius!
System Diagram Redux
57. Aggregate into Multiple Requests
• We have, for every possible combination:
• A combined set of profile statistics
• A combined set of logging data
• Ex: examining {servlet: biz_details}
• user logged in; long run!
• DC: east; user logged in; long run!
• DC: east; HTTP: POST; user logged in; long run!
• DC: east; HTTP: POST; site: main; logged in; long run
58. Diff Based on Call Count (n~1,000)
ztm@dev7-devb:~$ diff_pstats -s calls several_months_ago.profile recently.profile
SORTING BY DELTA IN calls BEFORE AFTER DELTA
yelp/util/request_bucketer/bucketer.py:lambda 485 3284 2798
...site-packages/staticconf/proxy.py:method 1967 3524 1557
yelp/util/experiments.py:genexpr 231 1620 1389
...site-packages/simplejson/encoder.py:iterencode 0 1189 1189
yelp/core/encapsulation.py:__new__ 0 1062 1062
Diff Based on Cumulative Runtime (n~1,000)
ztm@dev7-devb:~$ diff_pstats -s cum several_months_ago.profile recently.profile
SORTING BY DELTA IN cum BEFORE AFTER DELTA
yelp/wsgi/tweens.py:tween 1.352045 5.487169 4.135124
yelp/web/gatekeeper/check.py:_handle 0.000000 1.378666 1.378666
yelp/web/emergency_captcha.py:_handle 0.000000 1.378226 1.378226
yelp/util/cheetah/filters.py:markup_filter 0.000000 0.101759 0.101759
yelp/logic/decorators.py:wrapper 0.233577 0.321657 0.088080
yelp/logic/experiments.py:experiments_for_yuv 0.034188 0.120480 0.086293
yelp/util/request_bucketer/bucketer.py:get_bucket 0.049993 0.135661 0.085668
59. Storage Considerations, Per Day
• 152,814 discrete profile/log records
• 40,537 aggregate combinations (0-ary to 7-ary)
!
• 386,707 total files created
!
• 62.25 GB storage space used (all gzipped)
• 40.99 GB on aggregate profiles (without logs)
• 21.28 GB on individual profiles/logs
61. Performance Considerations
• Amazon Elastic MapReduce:
all units of work should take equal time
• This is not the case for our
aggregations!
• 60%: 10 or fewer profiles
• 95%: 1,000 or fewer profiles
• 8: over 100,000 profiles
62. Remember Ease of Use?
…Remember Daurius?
Ooh! It’s my
time to shine!
63.
64.
65.
66.
67. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
68. Enter… the Storm!
• Apache Storm
• Real-time distributed computation platform
• Directed graph of processing steps (tuples)
• Spouts - sources of data - like Scribe!
• Bolts - processors of data - like MRJob!
• Groupings - define how tuples move between…
69. Pyleus: A Python Framework for Storm Topologies
• Pyleus: Yelp’s super new Python Storm bindings
• Now open sourced! http://pyleus.org
• Build topologies in Python
• Declaratively describe structure in YAML
• Respects requirements.txt
• Compose a topology from Python packaged components!
74. Sample Pyleus Topology
- bolt: # equivalent of first mapper
name: process-ranger
module: profilistic.storm.process_ranger
groupings:
- shuffle_grouping: cprofile-sfo
- shuffle_grouping: cprofile-iad
!
- bolt: # equiv. of first reducer, plus an S3 cache
name: update-tag
module: profilistic.storm.update_tag
tasks: 6
parallelism_hint: 3
groupings:
- fields_grouping:
component: process-ranger
fields:
- tag
75. Sample Pyleus Topology
- bolt: # equivalent of first mapper
name: process-ranger
module: profilistic.storm.process_ranger
groupings:
- shuffle_grouping: cprofile-sfo
- shuffle_grouping: cprofile-iad
!
- bolt: # equiv. of first reducer, plus an S3 cache
name: update-tag
module: profilistic.storm.update_tag
tasks: 6
parallelism_hint: 3
groupings:
- fields_grouping:
component: process-ranger
fields:
- tag
76. Sample Pyleus Topology
- bolt: # equivalent of first mapper
name: process-ranger
module: profilistic.storm.process_ranger
groupings:
- shuffle_grouping: cprofile-sfo
- shuffle_grouping: cprofile-iad
!
- bolt: # equiv. of first reducer, plus an S3 cache
name: update-tag
module: profilistic.storm.update_tag
tasks: 6
parallelism_hint: 3
groupings:
- fields_grouping:
component: process-ranger
fields:
- tag
77. Sample Pyleus Bolt
class MyFirstBolt(pyleus.storm.SimpleBolt):
!
def initialize(self):
# set up any persistent config resources
staticconf.YamlConfiguration( ... )
self.bucket = s3.get_bucket( ... )
!
def process_tuple(self, tup):
key, value = tup
# do stuff here!
!
new_tup = (new_key, new_value)
self.emit(new_tup)
!
if __name__ == '__main__':
MyFirstBolt.run()
78. Sample Pyleus Bolt
class MyFirstBolt(pyleus.storm.SimpleBolt):
!
def initialize(self):
# set up any persistent config resources
staticconf.YamlConfiguration( ... )
self.bucket = s3.get_bucket( ... )
!
def process_tuple(self, tup):
key, value = tup
# do stuff here!
!
new_tup = (new_key, new_value)
self.emit(new_tup)
!
if __name__ == '__main__':
MyFirstBolt.run()
79. Sample Pyleus Bolt
class MyFirstBolt(pyleus.storm.SimpleBolt):
!
def initialize(self):
# set up any persistent config resources
staticconf.YamlConfiguration( ... )
self.bucket = s3.get_bucket( ... )
!
def process_tuple(self, tup):
key, value = tup
# do stuff here!
!
new_tup = (new_key, new_value)
self.emit(new_tup)
!
if __name__ == '__main__':
MyFirstBolt.run()
80. Sample Pyleus Bolt
class MyFirstBolt(pyleus.storm.SimpleBolt):
!
def initialize(self):
# set up any persistent config resources
staticconf.YamlConfiguration( ... )
self.bucket = s3.get_bucket( ... )
!
def process_tuple(self, tup):
key, value = tup
# do stuff here!
!
new_tup = (new_key, new_value)
self.emit(new_tup)
!
if __name__ == '__main__':
MyFirstBolt.run()
81. Profilistic in Pyleus
• Profiles used to be one day delayed
• Or, in emergencies, an ad hoc midday batch run
• Now, ~10 minutes after bad performance…
!
• You can investigate!
82. Roadmap
1. Why profile your code?
2. Create and analyze profiles
3. Acquire profiles from your webapp
4. Search and sort profiles
5. Aggregate similar profiles together
6. Search, sort, aggregate in real time
7. Future work, extensions, and possibilities
83. Future Work
1. Active monitoring
• For every new aggregation created each day
• Pull the same aggregation from 1 day, 1 week ago
• DIFF them!
• If the delta is too big, send an alert or an e-mail
• Easy add-on to end of Pyleus topology
84. Future Work
2. Visualization within the webapp
• Already possible ad hoc: graphviz files to PDF
• Most recent Yelp hackathon (F’14): Someone built this!
85. How do I… DIY?
1.Wrap the webapp in a context manager
2. Save profiles into the cloud
3. Tag profiles with attributes
4.Combine profiles based on attributes
5. Build quick-’n-dirty internal app to search/filter
6. Refactor it all into Storm?
7. Give the hedgehog a hug!
I believe in you! ♥
86. Yelp Dataset Challenge
Academic dataset from Phoenix, Las Vegas, Madison,
Waterloo and Edinburgh!
!
!
!
+
Your academic project, research and/or visualizations
submitted by December 31, 2014
=
$5,000 prize + $1,000 for publication + $500 for presenting*
yelp.com/dataset_challenge
*See full terms on website
● 1,125,458 Reviews
● 42,153 Businesses
○ 320,002 Business attributes
● 403,210 Tips
!!
● 252,898 Users
○ 955,999 Edge social graph
● 31,617 Checkin Sets
87. Thanks for listening!
Don’t be a stranger!
!
!
Python MapReduce package:
!
Python Storm package:
ztm@yelp.com
!
!
http://mrjob.org
!
http://pyleus.org