Big Data, Bigger Brains

November 6-9, Seattle, WA
Big Data, Bigger Brains
How Klout Changed the Landscape of Social
Media with Hadoop and BI
Denny Lee
Microsoft
Dave Mariani
Klout

Discover and be recognized for how
you influence the world

Klout’s Big Data makes all this possible
3
15 Social Networks Processed Every Day
769 Terabytes of Data Storage
200,000 Indexed Users Added Every Day
140,000,000 Users Indexed Every Day
12,000,000,000 Social Signals Processed Every Day
30,000,000,000 API Calls Delivered Every Month
1,080,000,000,000 Rows of Data In Data Warehouse

Klout Data Architecture
The Best tool for the job

6
What is Business Intelligence?
• Data Warehousing, OLAP, Dashboards, Reporting
• Ability to slice and dice data in an ad-hoc manner
• Getting the right data to the right people, at the right
time
• i.e. Now

Why Hadoop + BI?
Requirement
Hadoop
&
Hive
BI
Query
Engines
Capture & store all data Yes No
Support queries against detail data Yes No
Support interactive queries &
applications
No Yes
Support BI & visualization tools No Yes

An Example: Klout Event Tracker
1 Perform A|B Testing of User Flows
2 Optimize Registration Funnels
3 Monitor consumer engagement & retention (DAUs & MAUs)
4 Flexibly track and report on user generated events

A Flexible, Hierarchical Schema
Project:
Collection
of Events
Event:
Captured
User Action
Property
Type:
Attribute
Key
Property Value:
Attribute
Value
+K (Add a topic) event
Source,
Gender,
Location
Google Search
Male
SF
HomePage,
Actions,
Mobile iOS

Event Tracker Architecture
Warehouse
Instrument Collect Persist Query Report
Tracker API
Scala,
node.JS
Log Process
Flume
Cube
Analysis
Services
Klout UI
Scala,
AJAX UX
SELECT { [Measures].[Counter], [Measures].[PreviousPeriodCounter]}
ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].[Date].&[2012-06-
02T00:00:00]),
[Events].[Event].[Event].allmembers ) DIMENSION PROPERTIES
MEMBER_CAPTION
ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[plusK]})
event_log
tstamp string
project string
event string
session_id bigint
ks_uid bigint
ip string
json_keys array<string>
json_values
array<string>
json_text string
dt string
hr string
{
"project":"plusK",
"event":"spend",
"session_id":"0",
"ip":"50.68.47.158",
"kloutId":“123456",
“cookie_id":”123456",
"ref":"http://klout.com/",
"type":"add_topic",
"time":"1338366015"
}
will be saved in HDFS at:
/logs/events_tracking/2012-05-30/0100
insights3:9003/track/{"project":”plu
sK","event":”spend”,
"ks_uid":123456,”type":”add_topic"}

Limitations of Direct Connectivity
11

Pass through queries to linked servers
13

14
Linked Server Connection
EXEC master.dbo.sp_addlinkedserver
@server = N'HiveDW', @srvproduct=N'HIVE',
@provider=N'MSDASQL', @datasrc=N'Hive DW',
@provstr=N'Provider=MSDASQL.1;
Persist Security Info=True;User ID=SSAS;
Password=P@assw0rd;
CREATE VIEW vw_tbl_HiveSample AS
SELECT * FROM OpenQuery(HiveDW, 'SELECT * FROM
HiveSampleTable;')

OpenQuery / Linked Server Thoughts
• On Paper – this probably shouldn’t work
• Yet, it works and its been running for three
months in production now
• It’s more stable than the original MySQL
connection
• Has the advantage that from the cube
perspective, the DSV “looks like” its connecting
to a SQL Server
15

Demo: Hadoop to BI Workflow
16
Custom BI
Application
Commercial
BI Tool
Command
Line
HiveQL
In Excel

Hadoop & BI Together:
Query Cube using a Custom App
17

A Peek into
Product Insights >
Projects: Mobile
iOS
18

Query Cube Using Viz App
20

Query Hive using CLI
23

HiveQL Example
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.kloutId') as kloutId,
get_json_object(json_text,'$.v') as version,
get_json_object(json_text,'$.status') as status,
event
FROM bi.event_log
WHERE project='mobile-ios'
AND dt=20121027
AND event in ('api_error', 'api_timeout')
ORDER BY sid;

Query Hive using Excel
26

Digging deeper on influence
28

My influencers
Who the heck is
Cali Lewis?
29

30
So let’s dig in!
select * from openquery(hiveprodbi, 'select * from
bi_maxwell.actor_action where network_abbr = ''gp'' and
ks_uid = 1711 and actor_ks_uid = 477358')
It returns this data:
ks_uid service_uid
service_id tstamp
message_id action
actor_service_uid tstamp_type original_message_id
original_tstamp registered_flag actor_ks_uid
1711 103493459351957813291
13 1345073319000
NULL GOOGLE_PLUS_PLUSONES
100525279016049609152
ORIGINAL_CONTENT_CREATION
z12djbti1oelu1eff22lztphnqapsd25t04
1345073319000 1 477358

31
Digging in further…
Took the original_message_id field and looked
it up on Google+'s API
Go
to: https://developers.google.com/+/api/latest/
activities/get
Enter the content_id of
'z12djbti1oelu1eff22lztphnqapsd25t04' into the
'activityId' field in the form
It will return this URL in the
payload: https://plus.google.com/+CaliLewis/pos
ts/JxAKBWXy3zZ

Why Hadoop + BI?
Requirement
Hadoop
&
Hive
BI
Query
Engines
Capture & store all data Yes No
Support queries against detail data Yes No
Support interactive queries & applications No Yes
Support BI & visualization tools No Yes

Best Practices and Lessons Learned
• Avoid using a traditional database management
system for staging purposes
• Use the SQL Server OpenQuery interface for
heterogeneous joins
• Leverage Hive user-defined functions (UDFs) to
transform complex data types, such as JSON,
into rows and columns that Transact-SQL can
understand
• Manage large dimensions by using Hive views
34

• Make sure Hive UDFs are permanent and visible
to the ODBC provider.
• Adding to the .hiverc file – HiveCLI only
• Converting the UDF to a built-in function and
recompiling the Hive code
• Updating the Hive function registry to add the
UDFs to the built-in function list.
• modify the FunctionRegistry class to register the
UDFs (defined in an hql file)
• add all the dependent jars in hive.aux.jars.path
properties.
35

• Pad zero-length string data
SELECT
State =
CASE
When state = 'empty' Then Null
Else state
END,
Country =
CASE
When country = 'empty' Then Null
Else country
END
FROM OpenQuery(HiveDW,'SELECT
CASE
WHEN LENGTH(state) = 0 then ''empty''
ELSE COALESCE(state, ''empty'')
END AS state,
CASE
WHEN LENGTH(country) = 0 then ''empty''
ELSE COALESCE(country, ''empty'')
END AS country
FROM HiveSampleTable')
36

November 6-9, Seattle, WA39
Thank you
for attending this session and
the 2012 PASS Summit in Seattle

Big Data, Bigger Brains

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Big Data, Bigger Brains

Similar to Big Data, Bigger Brains (20)

More from Denny Lee

More from Denny Lee (20)

Recently uploaded

Recently uploaded (20)

Big Data, Bigger Brains

Editor's Notes