Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

3,707 views

Published on

Adam also works through how they can be used either in a real-time stream or in batch-mode in Hadoop (with Scalding). I'll hopefully have some examples of how to detect trendy meme-ish blogs on Tumblr.

Bio: Adam Laiacano is a Data Scientist and Engineer at Tumblr, a blogging network with over 140 million blogs, where he's responsible for collecting and analyzing large volumes of data to gain a better understanding of trends and activity within the Tumblr community. He holds a Bachelor of Science degree in Electrical Engineering from Northeastern University, and designed signal detection systems for low-power atomic clocks before joining Tumblr.

No Downloads

Total views

3,707

On SlideShare

0

From Embeds

0

Number of Embeds

1,453

Shares

0

Downloads

48

Comments

0

Likes

3

No embeds

No notes for slide

- 1. digital signal processing in hadoop with scalding Adam Laiacano adam@tumblr.com @adamlaiacano Thursday, October 17, 13
- 2. Overview • Intro to digital signals and ﬁlters • sampling • frequency domain • FIR / IIR ﬁlters • Very quick intro to Scalding • Filtering tons of signals at once • Application: Finding trending blogs on tumblr Thursday, October 17, 13
- 3. 1 sample / day Thursday, October 17, 13
- 4. 7-day average Thursday, October 17, 13 1 sample / day
- 5. Some Deﬁnitions Signal - Any series of data (Volts, posts, etc) that is measured at regular intervals. Sampling period, Ts - Time between samples (my example was Ts = 1 day) Sampling frequency fs - 1 / Ts Nyquist frequency - Highest frequency that can be represented = fs/2 Filter - A system to reduce or enhance certain aspects (phase, magnitude) of a signal. Stopband - The frequency range we want to eliminate Passband - The frequency range we want to preserve Cutoff frequency, fc - The boundary of the stopband/passband Thursday, October 17, 13
- 6. Signals Thursday, October 17, 13
- 7. Sampling Orignal Analog Thursday, October 17, 13 10 samples/period
- 8. Sampling 1 sample/period Thursday, October 17, 13 2 samples/period
- 9. Filters Thursday, October 17, 13
- 10. FILTER Thursday, October 17, 13
- 11. Low-Pass Filter Passband Stopband fc Thursday, October 17, 13 fn
- 12. Low-Pass Filter Passband Stopband Closer to reality fc Thursday, October 17, 13 fn
- 13. Moving Average Filter y[t] = 1/7 1/7 1/7 ... 1/7 Thursday, October 17, 13 * x[t] + * x[t-1] + * x[t-2] + * x[t-6]
- 14. FIR Digital Filter R code: y[t] = h[0] * h[1] * h[2] * ... h[N-1] y <- filter(x, h) Thursday, October 17, 13 x[t] + x[t-1] + x[t-2] + * x[t-N-1]
- 15. Frequency Domain x = 1.0 * sin(0.5*2*pi*t) + 0.5 * sin(250*2*pi*t) + 0.1 * sin(400*2*pi*t) Thursday, October 17, 13
- 16. Frequency Domain Thursday, October 17, 13
- 17. Frequency Domain 21-point low-pass ﬁlter with 250Hz cutoff h = [-0.0201, -0.0584, -0.0612, -0.0109, 0.0513, 0.0332, -0.0566, -0.0857, 0.0634, 0.3109, 0.4344, 0.3109, 0.0634, -0.0857, -0.0566, 0.0332, 0.0513, -0.0109, -0.0612, -0.0584, -0.0201] http://t-ﬁlter.appspot.com/ﬁr/index.html Thursday, October 17, 13
- 18. FIR vs IIR 1 y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1] Thursday, October 17, 13 y[t] = h[0] * x[t] + ... h[N-1] * x[t-N-1] g[1] * y[t-1] ... g[M] * y[t-M]
- 19. Delta Function Thursday, October 17, 13
- 20. FIR vs IIR y[t] = 1/7 * x[t] + 1/7 * x[t-1] + ... 1/7 * x[t-6] Thursday, October 17, 13 y[t] = 1/2 * x[t] + 1/2 * y[t-1]
- 21. IIR - be careful! y[t] = 0.5 * x[t] + 1.1 * y[t-1] Thursday, October 17, 13
- 22. Impulse Response Thursday, October 17, 13
- 23. Recap - FIR Filters • FIR ﬁlters are weighted sums of previous input. • Can think of them as a generalized Moving Average • Required to apply: • Filter h of length N • Previous N inputs x Thursday, October 17, 13
- 24. Thursday, October 17, 13
- 25. Super General Overview • DSL on top of Cascading, written in scala • Cascading: Workﬂow language for dealing with lots of data. Often in hadoop. • Similar to pig or hive, but easier to extend (no UDFs! one language!). • Feels like “real programming” - compiler! types! • Is awesome Thursday, October 17, 13
- 26. Less General Overview • Similar to split/apply/combine paradigm (plyr, pandas) • Load data into Pipes (like data.frames) • Each pipe has one or more Fields (columns) • Perform row-wise operations with map (d$a+d$b) • Perform ﬁeld-wise operations in groupBy Thursday, October 17, 13
- 27. Hello World Thursday, October 17, 13
- 28. Scalding Resources • The best resource is the Scalding wiki page. https://github.com/twitter/scalding/wiki/Fields-based-API-Reference • Edwin Chen’s post about recommendations. http://blog.echen.me/2012/02/09/movie-recommendations-and-more-via-mapreduce-and-scalding/ • Source code is FULL of undocumented features! https://github.com/twitter/scalding Thursday, October 17, 13
- 29. import Matrix._ Thursday, October 17, 13
- 30. Data Vector Sliding Filter = Sliding subset of input Thursday, October 17, 13
- 31. Data Vector Sliding Filter 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Thursday, October 17, 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ... ... 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = Sliding ﬁlter
- 32. Data Vector T * Thursday, October 17, 13 Filtered Output Filter Matrix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T =
- 33. Data Vector T * Thursday, October 17, 13 Filtered Output Filter Matrix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 T =
- 34. T X * H = Y N M T MxM * N = number of blogs M = number of samples Thursday, October 17, 13 T N = T M
- 35. Matrix Filter: Square Waves Thursday, October 17, 13
- 36. Matrix Filter: Square Waves Thursday, October 17, 13
- 37. Matrix Filter: Square Waves Thursday, October 17, 13
- 38. import Matrix._ • Scalding has a Matrix library! • Stores data in a Pipe as ('row, • Ideal for sparse matricies 'col, 'val) • L0, L1, L2 norm, inverse, +, -, * • QR Factorization: http://bit.ly/1hxWF17 • More! Thursday, October 17, 13
- 39. Tumblr Social Graph • • • • 140+ Million Nodes 3.5 Billion Edges About 100GB of raw text data 3 columns: fromId, toId, timestamp GOAL: Calculate followers / day for every blog, apply 1-week moving average. Thursday, October 17, 13
- 40. Apply Low-Pass ﬁlter to 140,000,000 blogs Thursday, October 17, 13
- 41. Find blogs who have accelerating follower counts for the most consecutive days. “Accelerating”: New Followers Today > New Followers Yesterday Thursday, October 17, 13
- 42. Blog A Blog B Blog C Thursday, October 17, 13
- 43. Blog A Blog B Blog C Thursday, October 17, 13
- 44. Blog A Blog B Blog C Thursday, October 17, 13
- 45. Days of consecutive acceleration • Binary input: 1 if more followers today than yesterday, otherwise 0 • Filter the binary signal to produce a value between 0 and 1 • Anything above a threshold (0.75) is “accelerating” Thursday, October 17, 13
- 46. Days of consecutive acceleration (ﬁltered signal) 15 days 36 days 48 days Thursday, October 17, 13
- 47. Days of consecutive acceleration (ﬁltered signal) 15 days 36 days 48 days Thursday, October 17, 13
- 48. Thanks! @adamlaiacano adamlaiacano.tumblr.com github.com/alaiacano/dsp-scalding Thursday, October 17, 13

No public clipboards found for this slide

Be the first to comment