Gabriel PREDA
@eRadical
(Almost) Serverless Analytics System
with BigQuery & AppEngine
Agenda
Going Serverless with
AppEngine & Tasks
Pub/Sub, DataStore
BigQuery
Load
Batch
Streaming Inserts
Query
UDF
Export
...some BigQueries...
AeonsSome years ago...
~ 500,000 - 2,000,000 events / day
(on average)
Some time ago...
~2,000,000 - 22,000,000 events / day
Dec 2014: 57,430,000 events / day
1 day to recompute » 12 hours
NOW()
22,000,000 - 70,000,000 events / day
AVG » 40,000,000 events / day
Processing ~30GB-70GB / day
Recompute 1 day » 10-20 minutes
serverless?
Desired for: https://www.innertrends.com
other... (almost) serverless products
Cloud Functions (alpha - Node.JS)
Cloud DataFlow (Java, Python - beta)
BigQuery
https://cloud.google.com/bigquery/docs/
BigQuery - data types
● STRING - UTF-8 (2 bytes + encoded string size)
● BYTES - base64 encoded (except in Avro)
● INTEGER - 64-bit signed (8 bytes)
● FLOAT (8 bytes)
● BOOLEAN - true/false, 1/0 only in CSV (1 byte)
● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes)
● DATE, TIME, DATETIME - limited support in Legacy SQL
● RECORD - a collection of fields (size of fields)
https://cloud.google.com/bigquery/data-types
BigQuery -> loadData()
Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental)
Tools: Web UI, bq, API
Source:
local files,
Cloud Storage, [demo]
Cloud Datastore (backup files),
POST requests,
SQL DML*
Google Sheets
- Federated Data Sources
- Streaming Inserts
BigQuery -> loadData()
bq load ...
BigQuery -> loadData()
Got some rows?
BigQuery -> SELECT … FROM surprise…
query:
SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ]
[ FROM from_body
[ WHERE bool_expression ]
[ OMIT RECORD IF bool_expression]
[ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ]
[ HAVING bool_expression ]
[ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ]
[ LIMIT n ]
];
from_body:
from_item [, ...] | # Warning: Comma means UNION ALL here
from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] |
(FLATTEN({ table_name | (query) }, field_name_or_alias)) |
table_wildcard_function
from_item:
{ table_name | (query) } [ [ AS ] alias ]
join_type:
{ INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
BigQuery -> SELECT … FROM surprise…
Date-Partitioned Tables [demo]
Table Decorators - See the past w/ @
Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY()
Interesting functions
- DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR()
» Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in.
- JSON_EXTRACT[_SCALAR]()
- URL functions » HOST(), DOMAIN(), TLD()
- REGEXP_MATCH(), REGEXP_EXTRACT()
bigquery.defineFunction(
'expandAssetLibrary', // Name of the function exported to SQL
['user_id', 'video_id', 'stage_settings'], // Names of input columns
[ {'name': 'user_id', 'type': 'integer'}, // Output schema
{'name': 'video_id', 'type': 'string'},
{'name': 'asset', 'type': 'string'} ],
expandAssetLibrary // Reference to JavaScript UDF
);
function expandAssetLibrary(row, emit) { …………………………
emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', ''));
}
BigQuery -> User Defined Functions
BigQuery -> DML
Standard SQL only
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
BigQuery -> export()
To: Google Cloud Storage
Format: CSV, JSON [.gz], Avro
…1G files
BigQuery -> some (Big)Queries
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE father_age < 18
GROUP BY year
ORDER BY year
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE mother_age < 18
GROUP BY year
ORDER BY year
SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
BigQuery -> some (Big)Queries
SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension,
COUNT(1) AS k
FROM [bigquery-public-data:github_repos.files]
GROUP BY file_extension
ORDER BY k DESC
LIMIT 20
SELECT table_id, row_count,
CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:github_repos.__TABLES__]
ORDER BY gb DESC

(Almost) Serverless Analytics System with BigQuery & AppEngine

  • 1.
    Gabriel PREDA @eRadical (Almost) ServerlessAnalytics System with BigQuery & AppEngine
  • 2.
    Agenda Going Serverless with AppEngine& Tasks Pub/Sub, DataStore BigQuery Load Batch Streaming Inserts Query UDF Export ...some BigQueries...
  • 3.
    AeonsSome years ago... ~500,000 - 2,000,000 events / day (on average)
  • 4.
    Some time ago... ~2,000,000- 22,000,000 events / day Dec 2014: 57,430,000 events / day 1 day to recompute » 12 hours
  • 5.
    NOW() 22,000,000 - 70,000,000events / day AVG » 40,000,000 events / day Processing ~30GB-70GB / day Recompute 1 day » 10-20 minutes
  • 6.
  • 7.
    other... (almost) serverlessproducts Cloud Functions (alpha - Node.JS) Cloud DataFlow (Java, Python - beta)
  • 8.
  • 9.
    BigQuery - datatypes ● STRING - UTF-8 (2 bytes + encoded string size) ● BYTES - base64 encoded (except in Avro) ● INTEGER - 64-bit signed (8 bytes) ● FLOAT (8 bytes) ● BOOLEAN - true/false, 1/0 only in CSV (1 byte) ● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes) ● DATE, TIME, DATETIME - limited support in Legacy SQL ● RECORD - a collection of fields (size of fields) https://cloud.google.com/bigquery/data-types
  • 10.
    BigQuery -> loadData() Formats:CSV, JSON (newline delimited), Avro, Parquet (experimental) Tools: Web UI, bq, API Source: local files, Cloud Storage, [demo] Cloud Datastore (backup files), POST requests, SQL DML* Google Sheets - Federated Data Sources - Streaming Inserts
  • 11.
  • 12.
  • 13.
    BigQuery -> SELECT… FROM surprise… query: SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ] [ FROM from_body [ WHERE bool_expression ] [ OMIT RECORD IF bool_expression] [ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ] [ HAVING bool_expression ] [ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ] [ LIMIT n ] ]; from_body: from_item [, ...] | # Warning: Comma means UNION ALL here from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] | (FLATTEN({ table_name | (query) }, field_name_or_alias)) | table_wildcard_function from_item: { table_name | (query) } [ [ AS ] alias ] join_type: { INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
  • 14.
    BigQuery -> SELECT… FROM surprise… Date-Partitioned Tables [demo] Table Decorators - See the past w/ @ Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY() Interesting functions - DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR() » Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in. - JSON_EXTRACT[_SCALAR]() - URL functions » HOST(), DOMAIN(), TLD() - REGEXP_MATCH(), REGEXP_EXTRACT()
  • 15.
    bigquery.defineFunction( 'expandAssetLibrary', // Nameof the function exported to SQL ['user_id', 'video_id', 'stage_settings'], // Names of input columns [ {'name': 'user_id', 'type': 'integer'}, // Output schema {'name': 'video_id', 'type': 'string'}, {'name': 'asset', 'type': 'string'} ], expandAssetLibrary // Reference to JavaScript UDF ); function expandAssetLibrary(row, emit) { ………………………… emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', '')); } BigQuery -> User Defined Functions
  • 16.
    BigQuery -> DML StandardSQL only Maximum UPDATE/DELETE statements per day per table: 48 Maximum UPDATE/DELETE statements per day per project: 500 Maximum INSERT statements per day per table: 1,000 Maximum INSERT statements per day per project: 10,000
  • 17.
    BigQuery -> export() To:Google Cloud Storage Format: CSV, JSON [.gz], Avro …1G files
  • 18.
    BigQuery -> some(Big)Queries SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE father_age < 18 GROUP BY year ORDER BY year SELECT year, count(1) FROM [bigquery-public-data:samples.natality] WHERE mother_age < 18 GROUP BY year ORDER BY year SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
  • 19.
    BigQuery -> some(Big)Queries SELECT REGEXP_EXTRACT(path, r'.*.(.*)$') AS file_extension, COUNT(1) AS k FROM [bigquery-public-data:github_repos.files] GROUP BY file_extension ORDER BY k DESC LIMIT 20 SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb FROM [bigquery-public-data:github_repos.__TABLES__] ORDER BY gb DESC