Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at Big Data Spain 2014
1. HANDS ON WITH BIGQUERY JAVASCRIPT UDFS
THOMAS PARK
SOFTWARE ENGINEER - GOOGLE
2. Hands-on with BigQuery JavaScript
User-Defined Functions
Thomas Park
Software Engineer - Google
Felipe Hoffa
@felipehoffa
Developer Advocate - Google
6. BigQuery: Big Data Analytics in the Cloud
Unrivaled
Performance and Scale
● Scan multiple TB’s in seconds
● Interactive query
performance
● No limits on amount of data
Ease of Use
and Adoption
● No administration /
provisioning
● Convenience of SQL
● Open interfaces
(REST, WebUI, ODBC)
● First 1 TB of data processed
per month is free
Advanced “Big Data”
Storage
● Familiar database structure
● Easy data management and
ACL’s
● Fast, atomic imports
7. Google confidential │ Do not distribute
How many pageviews does Wikipedia
have in a month?
SELECT COUNT(*)FROM
[fh-bigquery:wikipedia.wikipedia_views_201308]
https://bigquery.cloud.google.com/table/fh-bigquery:wikipedia.pagecounts_20140602_18
8. Google confidential │ Do not distribute
$500 in Cloud Platform credit
to launch your idea!
Build. Store. Analyze.
On the same infrastructure
that powers Google
Start building
Click ‘Apply Now’ and complete the
application with promo code: bigdata-spain
Starter Pack
Offer Description
1
2
3
Go to cloud.google.com/developers/starterpack
10. Images by Connie Zhou
Scenario:
Door access records from a
Very Well-Secured Lab
where users must badge in
to enter or leave
Image by Tod Kurt
11. Images by Connie Zhou
Example:
Time-series analysis
from discrete user action data
Image by Tod Kurt
12. user_id timestamp
Beep!!
9h: arrive @ lab thomas 2014.07.15 09:00
Beep!!
10h: leave to pick up
prototype
thomas 2014.07.15 10:00
Beep!!
10h15: return with
prototype
thomas 2014.07.15 10:15
Beep!!
12h: out for lunch thomas 2014.07.15 12:00
13. How can we find out how much time
each user spent in the lab?
...where each scan of the user’s access card is
represented as a discrete row?
14. rownum user_id timestamp
1 thomas 2014.07.15 09:00
2 thomas 2014.07.15 10:00
3 thomas 2014.07.15 10:15
4 thomas 2014.07.15 12:00
60 minutes
105 minutes
Our analysis with data
in this format via SQL
is horrid and painful
A BigQuery + JS
friendly format:
data for each user in
separate rows
user_id timestamps
thomas [ 09:00, 10:00, 10:15, 12:00, ... ]
hoffa [ 08:10, 11:30, 12:00, 12:15, ... ]
SELECT user_id, NEST(timestamp) AS timestamps
FROM T
GROUP BY user_id;
Producing this format
is trivial in BigQuery...
15. // This function will be called once for each user,
// and receive an array of timestamps.
function(record, emit) {
var total_time = 0;
// Order of records built by NEST are not guaranteed!
// Sort to guarantee ascending timestamps.
var ts = record.ts.sort(
function (a, b) {return a > b;});
// Loop over timestamp pairs, calculate interval.
for (var i = 0; i < ts.length - 1; i += 2) {
total_time += (ts[i+1] - ts[i]);
}
// Emit total time for this user.
emit({user: record.user_id,
total_time: total_time});
}
JS: Total time for each
user
23. How BigQuery works
Get data from lower levels,
filter / join / transform,
send rows up
Tree Structured Query Dispatch and Aggregation
Distributed Storage
SELECT title, requests
Leaf Leaf Leaf Leaf
SUM(requests)
GROUP BY title
WHERE REGEX_MATCH(title, 'pat.*rn')
Mixer 1 Mixer 1 SUM(requests)
GROUP BY title
Mixer 0
LIMIT 10
ORDER BY c DESC
SUM(requests)
GROUP BY title
24. Data for each row is calculated and
streamed through a “Row Iterator”
Subquery0 Subquery1
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
25. Can insert JavaScript Functions
wherever we have a Row Iterator
Subquery0 Subquery1
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0
26. Join order item info with web hits info
SELECT
item FROM
orders
SELECT query
string FROM hits
JOIN
Row Iterator 0 Row Iterator 1
Row Iterator 2
UDF1
UDF0