3. agenda
Big Data in Google and Google BigQuery
Why BigQuery is so fast?
Real-time Streaming Import by Fluentd + BigQuery
Real-time KPI analytics by Lambda Architecture
11. Gaming, Social, Mobile
Ads, Digital Marketing, DMP,
Media
Monitoring, Alerting and Security
Retails
Internet of Things (IoT)
Applications
12. BigQuery Analytic Service in the Cloud
BigQuery
R and Pandas
Microsoft Excel
Google Spreadsheet
Hadoop/Hive
Spark
Adwords
DoubleClick
Google Analytics
Event Logs,
Databases
IoT Devices
Analyze Export
BI Tools
Import
Import, Analyze and Export
16. select top(title), count(*)
from publicdata:samples.wikipedia
Massively Parallel Processing
Scanning 1 TB in 1 sec
takes 5,000 disks
Each query runs on thousands of servers
17. Fast aggregation by tree
structure
Mixer 0
Mixer 1 Mixer 1
Shard Shard Shard Shard
ColumnIO on Colossus SELECT state, year
COUNT(*)
GROUP BY state
WHERE year >= 1980 and year < 1990
ORDER BY count_babies
DESC
LIMIT 10
COUNT(*)
GROUP BY state
18. Inside BQ: Big JOIN
Big JOIN: executed with shuffling
- Both tables can be > 8MB
- BQ shuffler doesn’t sort; just hash partitioning
From: Google BigQuery Analytics
20. “I want a real-time dashboard
for collecting the votes and
system stats from 200 servers”
21. BigQuery Streaming
Low cost: $0.01
per 100,000 rows
Real time
availability of data
100,000 rows per
second x tables
22. Slideshare uses Fluentd for collecting logs from >500 servers.
"We take full advantage of its extendable plugin architecture and use it as a message bus that collects data from
hundreds of servers into multiple backend systems." Sylvain Kalache, Operations Engineer
23. Why Fluentd? Because it’s super easy to use,
and has extensive plugins written by active community.
24. Now Fluentd logs can be imported to
BigQuery really easy, ~1M rows/s
28. Lambda Architecture is:
A complementary pair of:
- in-memory real-time processing
- large HDD/SSD batch processing
Proposed by Nathan
Marz
ex. Twitter
Slow, but large and persistent.
Fast, but small and volatile.
29. Norikra: an open source stream processing tool
Production use at LINE, the largest asian SNS with 500M users, for massive log
analysis
Super easy to use: requires no heavy-weighted cluster set-up
32. Proposed Solution: Lambda Architecture
Fluentd: event log collection from various event sources
Norikra: easy, scalable real time stream processing
BigQuery: scalable query engine for large datasets
1
2
3
Google Spreadsheet: flexible dashboard with charts
Docker: repeatable deployment in 10 minutes
4
5
34. ● Gaming: How many new users has purchased the first item in last 10 minutes?
● Media: How many people hit the vote button during the live TV program?
● Retail: What is the current total revenue of all stores nationwide?
● Ads: What is the conversion rate of impressions/clicks to purchase?
● Co-relate system resource usage with access/application logs
● Real-time DoS or cheating detection
● Send e-mail notification from Apps Script triggered by Norikra
Real-time KPI Dashboard
Real-time Monitoring and Alerting
Applications
35. Easy real-time SQL-based KPI analytics
at 1M+ rows/sec by Norikra
Easy real-time streaming import
at 1M+ rows/sec by BigQuery + Fluentd
Search “lambda dashboard” on GitHub
Solution Benefits
Real-time dashboard with Google
Spreadsheet
Deployable within 10 min with Docker