In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Klout changing landscape of social media
1. How Klout is changing the
landscape of social media with
Hadoop and BI
Dave Mariani
VP Engineering, Klout
Denny Lee
Principal Program Manager
Microsoft
3. Klout’s Big Data makes all this possible
15 Social Networks Processed Every Day
120 Terabytes of Data Storage
200,000 Indexed Users Added Every Day
140,000,000 Users Indexed Every Day
1,000,000,000 Social Signals Processed Every Day
30,000,000,000 API Calls Delivered Every Month
54,000,000,000 Rows of Data In Klout Data Warehouse
3
4. Scenario and Definitions
Project: Event: Category: Property:
Collection Captured Attribute Event
of Events User Action Type Attribute
+K (Add a topic) event
Topic, {Big Data, BI}
Gender, {Male}
Location {Palo Alto}
5. Klout Event Tracker
1 Perform A|B Testing of User Flows
2 Optimize User Registration Funnels
3 Monitor consumer engagement & retention (DAUs & MAUs)
4 Flexibly track and report on user generated events
5
6. Klout Event Tracker Requirements
3rd Party
Hadoop BI
Web
Requirement & Query
Analytics
Hive Engines
Tools
Capture & store all user and visitor events No Yes No
Integrate internal Klout Data No Yes No
Support queries against granular data No Yes No
Support interactive queries Yes No Yes
Support 3rd party BI tools No No Yes
“Query-able” by custom apps No No Yes
TODO: Make this look good and use animation to “blend” the last 2 columns
6
7. Klout Data Architecture – The Best Tool for the Job
Serve
Signal
Collectors Registrations DB
(Java/ (MySql)
Data
Scala) Enhancement
Engine Klout.com
(PIG/Hive) Profile DB (Node.js)
Klout API
(HBase)
(Scala)
Data Warehouse
(Hive)
Search Index Mobile
(Elastic Search)
In (ObjectiveC)
Store & Enhance Streams
(MongoDB)
Monitoring
(Nagios)
Dashboards
(Tableau)
Analytics
Analyze Cube Perks Analyics
(SSAS) (Scala)
Event Tracker
(Scala)
TO DO: Need to animate the red boxes & make this look better +
add Instrument, Collect, Persist, Query, Report information 7
8. TO DO: make this look better +
add Instrument, Collect, Persist, Query, Report information
If possible, merge slides 7 and 8 together
8
9. A Peek into Product Insights >
A|B Test Example for Viral Workflow
9
12. A Peek into Product Insights >
Projects: Mobile iOS
12
13. Projects > Mobile iOS > Scala/JavaScript API
DB.withConnection("cube")(implicit conn => {
var sql1 = SQL("""
select
[[Date]].[Date]].[Date]].[MEMBER_CAPTION]]] AS date,
...
convert(int, [[Measures]].[Counter]]]) AS cnt
from openquery(productinsight, ’
SELECT {[Measures].[Counter]} ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
{[Date].[Date].&[""" + dateFormat(past) + """]:...
) DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS
FROM [ProductInsight]
')
""")
sql1().iterator.foreach(row => {
// process row
val event = row[String]("event")
// ....
})
14. Projects > Mobile iOS > Actual MDX
SELECT {
[Measures].[Counter],
[Measures].[PreviousPeriodCounter]
} ON COLUMNS,
NON EMPTY CROSSJOIN (
exists([Date].[Date].[Date].allmembers,
[Date].[Date].&[2012-05-19T00:00:00]:[Date].
[Date].&[2012-06-02T00:00:00]),
[Events].[Event].[Event].allmembers
)
DIMENSION PROPERTIES MEMBER_CAPTION ON ROWS
FROM [ProductInsight]
WHERE ({[Projects].[Project].[mobile-ios]})
16. Drilling down to the Events > HiveQL Query
CREATE TABLE mobile-ios-details-20120530 as
SELECT
get_json_object(json_text,'$.sid') as sid,
get_json_object(json_text,'$.inc') as inc,
get_json_object(json_text,'$.status') as status,
event json_text
FROM bi.event_log
WHERE project="mobile-ios"
AND dt=20120530
AND get_json_object(json_text,'$.v')!='1.5'
AND (event = 'api_error' OR event = 'api_timeout')
DISTRIBUTE BY get_json_object(json_text,'$.sid')
SORT BY get_json_object(json_text,'$.sid') asc
18. Summary
• Leverage the best tool for the function or job
• Big Data != Business Intelligence
• Go open source wherever possible but use commercial
software when needed
18