Social media are computer-mediated tools that allow people to create, share or exchange information, ideas, and pictures/videos in virtual communities and networks. To sum up Social Media is everything for your customers and Your company need to listen them to understand, make a custom offer or improve loyalty etc. Azure Stream Analytics and HDInsight platforms can solve this problem for you. We'll focus on how to get Twitter data using Stream Analytics and how to make data enrichment and storing using HDInsight and What is the problem about sentiment analytics using Azure Machine Learning.
7. #sqlsatistanbul
What kind of solutions using Big Data
• Clickstream analysis to find buying patterns
• Sentiment analysis for text data
• Fraud detection; forensic analysis
• Machine learning
• Healthcare research
• Predictive Maintenance
Just dream it. Data is everywhere!
8.
9. Twitter launched in 2006
Active users per month
~316 Millions (August)
~320 Millions (October)
%80 of users is Mobile!
Tweets per second 6.000
Tweets per day ~500 Million
Tweets per year ~200 Billion
Twitter generate a lot of data (12
TB per day)
90 % of buyers trust peer
recommendations
55 % of Twitter users are females
The average Twitter user has 27
Followers
13. Event based data
Unstructured data
Detail event information
Streaming
Who is the influencer
TweetTracker
TweetArchivist
Radian6
Sysomos
Tweet Deck
Hootsuite
Twitter Problems Dashboards For Tweets
15. #sqlsatistanbul
1. Collect Twitter Data & Get Simple Information
2. Data Enrichment
3. Store Semi - Structured Data
4. Analyze Semi - Structured Data
5. Visualize Meaningful Results
19. #sqlsatistanbul
Real-Time Analytics
Intake millions of events per second
Process data from connected devices/apps
Detect patterns and anomalies in streaming data
Transform, augment, correlate, temporal operations
No hardware (PaaS offering)
Up and running in a few clicks (and within minutes)
No performance tuning
Efficiently pay only for usage
Not paying for idle resources
Low startup costs
Scale from small to large when required
Only SQL queries needed (Thousand lines of code in other solutions, such as Apache Storm)
20. #sqlsatistanbul
Stream Analytics Query Language Functions
DML Statements
• SELECT
• FROM
• WHERE
• GROUP BY
• HAVING
• CASE
• JOIN
• UNION
Windowing Extensions
• Tumbling Window
• Hopping Window
• Sliding Window
• Duration
Aggregate Functions
• SUM
• COUNT
• AVG
• MIN
• MAX
Scaling Functions
• WITH
• PARTITION BY
Date and Time Functions
• DATENAME
• DATEPART
• DAY
• MONTH
• YEAR
• DATETIMEFROMPARTS
• DATEDIFF
• DATADD
String Functions
• LEN
• CONCAT
• CHARINDEX
• SUBSTRING
Statistical Functions
• VAR
• VARP
• STDEV
22. 0 5 10 15 20 25 30
4
4
5
The count of tweets every 10 secondsTumbling Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, TumblingWindow(second,10)
23. 0 5 10 15 20 25 30
Every 5 seconds give me the count of
tweets over 10 seconds by topic
Hopping Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, HoppingWindow(second,10,5)
24. 0 5 10 15 20 25 30
If the tweets count is above a threshold
of 8 for a total of 5 seconds
Sliding Windows
SELECT Topic, Count(*) AS Count
FROM sqlsaturdaystream TIMESTAMP BY CreatedAt
GROUP BY Topic, SlidingWindow(second,5)
HAVING Count(*)>8
27. #sqlsatistanbul
Data Azure Machine Learning Consumers
Local storage
Upload data from PC…
Cloud storage
Azure Storage
Azure Table
Hive
etc.
Excel
Business Apps
Business problem Modeling Business valueDeployment
Azure Marketplace
(Applications store)
Azure ML Gallery
(community)
ML Web Services
(REST API Services)
ML Studio
(Web IDE)
Workspace:
Experiments
Datasets
Trained models
Notebooks
Access settings
Data Model API
Manage
API
30. #sqlsatistanbul
SQL Server 2016
CTP 3.1
Revolution R Open
3.2.2 for Revolution
R Enterprise
Revolution R
Enterprise 7.5.0
Revolution R Enterprise is able to deliver speeds 42 times faster than competing technology from SAS.
Microsoft announced on January 23, 2015 that they had reached an agreement to purchase Revolution Analytics for an as yet undisclosed amount.
31. #sqlsatistanbul
The Klout Score is a number between 1-100 that
represents your influence.
Collect and normalize more than 12 billion signals
a day
Hive data warehouse of more than 1 trillion rows
Klout acquired for $200 million by Lithium
Technologies
34. #sqlsatistanbul
Developed by Facebook. Later it was adopted in Apache as an open source project.
A data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis
Integration between Hadoop and BI and visualization
Provides an SQL Like language called Hive QL to query data
Create Index, includes Partitioning
Not supported Update (isn’t correct)
Hive provides Users, Groups, Roles. But it’s not designed for high security.
Console (hive>), script, ODBC/JDBC, SQuirreL, HUE, Web Interface, etc.
Most popular Business Intelligence Tools support Hive
35. #sqlsatistanbul
Data Types
Primitive Data Types: int, bigint, float, double, boolean, decimal, string, timestamp, date etc.
Complex Data Types: arrays, maps, structs
ARRAY<string>: workplace: istanbul, ankara
STRUCT<sex:string,age:int> : Female,25
MAP<string,int>: SOLR:92
Hive RDBMS
SQL Interface SQL Interface
Focus on analytics ay focus on online or analytics
No transactions Transactions usually supported
Partition adds, no random Inserts. Random Insert and Update supported
Distributed processing via map/reduce Distributed processing varies by vendor (if available)
Scales to hundreds of nodes Seldom scale beyond 20 nodes
Built for commodity hardware Often built on proprietary hardware (especially when scaling out)
Low cost per petabyte What's petabyte? :) (note: Are you sure?)
38. #sqlsatistanbul
Originally developed at Yahoo! (Huge contributions from Hortonworks, Twitter)
A Platform for analyzing large data sets that consists of high-level language for expressing data analysis programs
Processing large semi-structured data sets using Hadoop Map Reduce
Write complex MapReduce jobs using a simple script language (Pig Latin)
Pig provides a bunch of aggregation function (AVG, COUNT, SUM, MAX, MIN etc.)
Developers can develop UDF
Console (grunt), script, java, HUE (Hadoop User Experience by Cloudera)
Easy to use and efficient
39. #sqlsatistanbul
Data Types
Simple Data Types: int, float, double, chararray (UTF-8), bytearray
Complex Data Types: map (Key,Value), Tuple, Bag (list of tuples)
Commands
Loading: LOAD, STORE, DUMP
Filtering: FILTER, FOREACH, DISTINCT
Grouping: JOIN, GROUP, COGROUP, CROSS
Ordering: ORDER, LIMIT
Merging & Split: UNION, SPLIT
SQL SCRIPT PIG SCRIPT
SELECT * FROM TABLE A=LOAD 'DATA' USING PigStorage('t') AS (col1:int, col2:int, col3:int);
SELECT col1+col2, col3 FROM TABLE B=FOREACH A GENERATE col1+col2, col3;
SELECT col1+col2, col3 FROM TABLE WHERE col3>10 C=FILTER B by col3>10;
SELECT col1, col2, sum(col3) FROM X GROUP BY col1, col2 D=GROUP A BY (col1,col2);
E=FOREACH D GENERATE FLATTEN(group), SUM(A.col3);
... HAVING sum(col3) > 5 F=FILTER E BY $2>5;
... ORDER BY col1 G=ORDER F BY $0
SELECT DISTINCT col1 FROM TABLE I=FOREACH A GENERATE col1;
J=DISTINCT I;
SELECT col1,COUNT(DISTINCT col2) FROM TABLE GROUP BY col1
K=GROUP A BY col1;
L=FOREACH K {M=DISTINCT A.col2; GENERATE FLATTEN(group), count(M);}
43. #sqlsatistanbul
Big Data Analytics, Implementing Big Data Analysis, Big Data Analytics with HDInsight, Big Data
and Business Analytics Immersion, Getting Started with Microsoft Azure Machine Learning
Real World Big Data in Azure, Big Data on Amazon Web Services, Reporting with MongoDB,
Cloud Business Intelligence, HDInsight Deep Dive: Storm HBase and Hive, Data Science &
Hadoop Workflows at Scale With Scalding, SQL on Hadoop - Analyzing Big Data with Hive
Introduction to Big Data Analytics, Machine Learning with Big Data, Big Data Analytics for
Healthcare, Data Science at Scale, The Data Scientist's Toolbox, R Programming
Master Big Data and Hadoop Step by Step, Hadoop Essentials, Hadoop Starter Kit, Data Analytics
using Hadoop eco system, Big Data: How Data Analytics Is Transforming the World, Applied Data
Science with R, Hadoop Enterprise Integration
Data Science and Analytics in Context, Introduction to Big Data with Spark, Data Science and
Machine Learning Essentials, Machine Learning for Data Science and Analytics, Statistical
Thinking for Data Science and Analytics