SlideShare a Scribd company logo
Marko Grobelnik 
marko.grobelnik@ijs.si 
Jozef Stefan Institute 
Ljubljana, Slovenia 
Stavanger, May 8th 2012
 
Introduction 
◦ 
What is Big data? 
◦ 
Why Big-Data? 
◦ 
When Big-Data is really a problem? 
 
Techniques 
 
Tools 
 
Applications 
 
Literature
 
‘Big-data’ is similar to ‘Small-data’, but bigger 
 
…but having data bigger consequently requires different approaches: 
◦ 
techniques, tools & architectures 
 
…to solve: 
◦ 
New problems… 
◦ 
…and old problems in a better way.
From “Understanding Big Data” by IBM
Big-Data
 
Key enablers for the growth of “Big Data” are: 
◦ 
Increase of storage capacities 
◦ 
Increase of processing power 
◦ 
Availability of data
 
NoSQL 
◦ 
DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper 
 
MapReduce 
◦ 
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum 
 
Storage 
◦ 
S3, Hadoop Distributed File System 
 
Servers 
◦ 
EC2, Google App Engine, Elastic, Beanstalk, Heroku 
 
Processing 
◦ 
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
 
…when the operations on data are complex: 
◦ 
…e.g. simple counting is not a complex problem 
◦ 
Modeling and reasoning with data of different kinds can get extremely complex 
 
Good news about big-data: 
◦ 
Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics)… 
◦ 
…as long as we deal with the scale
 
Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are sub- cubes within the data cube 
Scalability 
Dynamicity 
Context 
Quality 
Usage
 
Good recommendations can make a big difference when keeping a user on a web site 
◦ 
…the key is how rich context model a system is using to select information for a user 
◦ 
Bad recommendations <1% users, good ones >5% users click 
Contextual personalized recommendations generated in ~20ms
 
Domain 
 
Sub-domain 
 
Page URL 
 
URL sub-directories 
 
Page Meta Tags 
 
Page Title 
 
Page Content 
 
Named Entities 
 
Has Query 
 
Referrer Query 
 
Referring Domain 
 
Referring URL 
 
Outgoing URL 
 
GeoIP Country 
 
GeoIP State 
 
GeoIP City 
 
Absolute Date 
 
Day of the Week 
 
Day period 
 
Hour of the day 
 
User Agent 
 
Zip Code 
 
State 
 
Income 
 
Age 
 
Gender 
 
Country 
 
Job Title 
 
Job Industry
Log Files 
(~100M page clicks per day) 
User profiles 
NYT articles 
Stream of profiles 
Advertisers 
Segment 
Keywords 
Stock Market 
Stock Market, mortgage, banking, investors, Wall Street, turmoil, New York Stock Exchange 
Health 
diabetes, heart disease, disease, heart, illness 
Green Energy 
Hybrid cars, energy, power, model, carbonated, fuel, bulbs, 
Hybrid cars 
Hybrid cars, vehicles, model, engines, diesel 
Travel 
travel, wine, opening, tickets, hotel, sites, cars, search, restaurant 
… 
… 
Segments 
Trend Detection System 
Stream of clicks 
Trends and 
updated segments 
Campaign 
to sell segments 
$ 
Sales
 
50Gb of uncompressed log files 
 
10Gb of compressed log files 
 
0.5Gb of processed log files 
 
50-100M clicks 
 
4-6M unique users 
 
7000 unique pages with more then 100 hits 
 
Index size 2Gb 
 
Pre-processing & indexing time 
◦ 
~10min on workstation (4 cores & 32Gb) 
◦ 
~1hour on EC2 (2 cores & 16Gb)
 
Alarms Explorer Server implements three real-time scenarios on the alarms stream: 
1. 
Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 
2. 
Short-Term Fault Prediction – predict which device will fail in next 15mins 
3. 
Long-Term Anomaly Detection – detect unusual trends in the network 
 
…system is used in British Telecom 
Alarms Server 
Alarms Explorer Server 
Live feed of data 
Operator 
Big board display 
Telecom Network (~25 000 devices) 
Alarms ~10-100/sec
 
Presented in “Planetary-Scale Views on a Large Instant-Messaging Network” by Jure Leskovec and Eric Horvitz WWW2008
 
Observe social and communication phenomena at a planetary scale 
 
Largest social network analyzed to date 
Research questions: 
How does communication change with user demographics (age, sex, language, country)? 
How does geography affect communication? 
What is the structure of the communication network? 33
 
We collected the data for June 2006 
 
Log size: 
150Gb/day (compressed) 
 
Total: 1 month of communication data: 
4.5Tb of compressed data 
 
Activity over June 2006 (30 days) 
◦ 
245 million users logged in 
◦ 
180 million users engaged in conversations 
◦ 
17,5 million new accounts activated 
◦ 
More than 30 billion conversations 
◦ 
More than 255 billion exchanged messages 34
35
36
 
Count the number of users logging in from particular location on the earth 37
 
Logins from Europe 38
 
6 degrees of separation [Milgram ’60s] 
 
Average distance between two random users is 6.6 
 
90% of nodes can be reached in < 8 hops 
Hops 
Nodes 
1 
10 
2 
78 
3 
396 
4 
8648 
5 
3299252 
6 
28395849 
7 
79059497 
8 
52995778 
9 
10321008 
10 
1955007 
11 
518410 
12 
149945 
13 
44616 
14 
13740 
15 
4476 
16 
1542 
17 
536 
18 
167 
19 
71 
20 
29 
21 
16 
22 
10 
23 
3 
24 
2 
25 
3
Big data tutorial_part4
Big data tutorial_part4

More Related Content

What's hot

Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
suresh sood
 
Datapreneurs
DatapreneursDatapreneurs
Datapreneurs
suresh sood
 
Big Data
Big DataBig Data
Big Data
Charles Mok
 
Systemof insight
Systemof insightSystemof insight
Systemof insight
suresh sood
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Steven Ramage
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
Nicola Ferraro
 
The Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean ProtocolThe Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean Protocol
Trent McConaghy
 
Big data
Big dataBig data
Big data
Nandan Shah
 
Spark
SparkSpark
Big-Data Computing on the Cloud
Big-Data Computing on the CloudBig-Data Computing on the Cloud
Big-Data Computing on the Cloud
Data Driven Innovation
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Connected Data World
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
303Computing
 
Graph Database in Graph Intelligence
Graph Database in Graph IntelligenceGraph Database in Graph Intelligence
Graph Database in Graph Intelligence
Chen Zhang
 
Web search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introductionWeb search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introduction
Ali Dasdan
 
The future of mobile and big data
The future of mobile and big dataThe future of mobile and big data
The future of mobile and big data
spirit conference
 
Tackling climate change through agricultural supply chain transparency | Javi...
Tackling climate change through agricultural supply chain transparency | Javi...Tackling climate change through agricultural supply chain transparency | Javi...
Tackling climate change through agricultural supply chain transparency | Javi...
Connected Data World
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
Connected Data World
 
Software is eating the world data is digesting it DF2013
Software is eating the world data is digesting it DF2013Software is eating the world data is digesting it DF2013
Software is eating the world data is digesting it DF2013
James Governor
 
Neo4j GraphTalks Munich - Einführung
Neo4j GraphTalks Munich - EinführungNeo4j GraphTalks Munich - Einführung
Neo4j GraphTalks Munich - Einführung
Neo4j
 

What's hot (20)

Spark Social Media
Spark Social Media Spark Social Media
Spark Social Media
 
Datapreneurs
DatapreneursDatapreneurs
Datapreneurs
 
Big Data
Big DataBig Data
Big Data
 
Systemof insight
Systemof insightSystemof insight
Systemof insight
 
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven RamageGeospatial Intelligence Middle East 2013_Big Data_Steven Ramage
Geospatial Intelligence Middle East 2013_Big Data_Steven Ramage
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
The Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean ProtocolThe Web3 Data Economy: Ocean Protocol
The Web3 Data Economy: Ocean Protocol
 
Big data
Big dataBig data
Big data
 
Spark
SparkSpark
Spark
 
Big-Data Computing on the Cloud
Big-Data Computing on the CloudBig-Data Computing on the Cloud
Big-Data Computing on the Cloud
 
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
Enterprise Data Governance: Leveraging Knowledge Graph & AI in support of a d...
 
Big Data and Classification
Big Data and ClassificationBig Data and Classification
Big Data and Classification
 
Graph Database in Graph Intelligence
Graph Database in Graph IntelligenceGraph Database in Graph Intelligence
Graph Database in Graph Intelligence
 
Web search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introductionWeb search-metrics-tutorial-www2010-section-1of7-introduction
Web search-metrics-tutorial-www2010-section-1of7-introduction
 
The future of mobile and big data
The future of mobile and big dataThe future of mobile and big data
The future of mobile and big data
 
Tackling climate change through agricultural supply chain transparency | Javi...
Tackling climate change through agricultural supply chain transparency | Javi...Tackling climate change through agricultural supply chain transparency | Javi...
Tackling climate change through agricultural supply chain transparency | Javi...
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
 
Graph Realities
Graph RealitiesGraph Realities
Graph Realities
 
Software is eating the world data is digesting it DF2013
Software is eating the world data is digesting it DF2013Software is eating the world data is digesting it DF2013
Software is eating the world data is digesting it DF2013
 
Neo4j GraphTalks Munich - Einführung
Neo4j GraphTalks Munich - EinführungNeo4j GraphTalks Munich - Einführung
Neo4j GraphTalks Munich - Einführung
 

Viewers also liked

Big data mc_05_2014_long
Big data mc_05_2014_longBig data mc_05_2014_long
Big data mc_05_2014_longAxel Poestges
 
Big Data: Kunden auf der Spur
Big Data: Kunden auf der SpurBig Data: Kunden auf der Spur
Big Data: Kunden auf der Spur
B2B Smartdata GmbH
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
Marko Grobelnik
 
Big Data and E-Commerce
Big Data and E-CommerceBig Data and E-Commerce
Big Data and E-Commerce
contentanalyticsinc
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Marko Grobelnik
 
Big Data in e-Commerce
Big Data in e-CommerceBig Data in e-Commerce
Big Data in e-Commerce
Divante
 
How to Spark Creativity, Drive Innovation, and Ensure Sustainability
How to Spark Creativity, Drive Innovation, and Ensure Sustainability How to Spark Creativity, Drive Innovation, and Ensure Sustainability
How to Spark Creativity, Drive Innovation, and Ensure Sustainability
Faisal Hoque
 

Viewers also liked (8)

Big data mc_05_2014_long
Big data mc_05_2014_longBig data mc_05_2014_long
Big data mc_05_2014_long
 
Big Data: Kunden auf der Spur
Big Data: Kunden auf der SpurBig Data: Kunden auf der Spur
Big Data: Kunden auf der Spur
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Big Data - einfach erklärt!
Big Data - einfach erklärt!Big Data - einfach erklärt!
Big Data - einfach erklärt!
 
Big Data and E-Commerce
Big Data and E-CommerceBig Data and E-Commerce
Big Data and E-Commerce
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
Big Data in e-Commerce
Big Data in e-CommerceBig Data in e-Commerce
Big Data in e-Commerce
 
How to Spark Creativity, Drive Innovation, and Ensure Sustainability
How to Spark Creativity, Drive Innovation, and Ensure Sustainability How to Spark Creativity, Drive Innovation, and Ensure Sustainability
How to Spark Creativity, Drive Innovation, and Ensure Sustainability
 

Similar to Big data tutorial_part4

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
Aravindharamanan S
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
GV prasad
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
Pragati Singh
 
Big data tutorial
Big data tutorialBig data tutorial
Big data tutorial
Vishal Sarkar
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
eswcsummerschool
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
STS FORUM 2016
 
Big data business case
Big data   business caseBig data   business case
Big data business case
Karthik Padmanabhan ( MLE℠)
 
Big data
Big dataBig data
Big data
raghav125
 
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
Skillwise Consulting
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
IIIT Allahabad
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
Sanoj Kumar
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
eGov Innovation Center
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
Rukshan Batuwita
 
ai based computer basic learning Lecture about Bigdata.ppt
ai based computer basic learning Lecture about Bigdata.pptai based computer basic learning Lecture about Bigdata.ppt
ai based computer basic learning Lecture about Bigdata.ppt
ALAMGIRHOSSAIN256982
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Demi Ben-Ari
 
Big Data World
Big Data WorldBig Data World
Big Data World
Hossein Zahed
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
Aniekan Akpaffiong
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
amiyadash
 

Similar to Big data tutorial_part4 (20)

Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial
Big data tutorialBig data tutorial
Big data tutorial
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data
Big dataBig data
Big data
 
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
MongoDB Days Silicon Valley: Jumpstart: The Right and Wrong Use Cases for Mon...
 
SKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSISSKILLWISE-BIGDATA ANALYSIS
SKILLWISE-BIGDATA ANALYSIS
 
Big Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar SemwalBig Data By Vijay Bhaskar Semwal
Big Data By Vijay Bhaskar Semwal
 
Big data and Internet
Big data and InternetBig data and Internet
Big data and Internet
 
Big Data et eGovernment
Big Data et eGovernmentBig Data et eGovernment
Big Data et eGovernment
 
Big Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our LivesBig Data and Data Science: The Technologies Shaping Our Lives
Big Data and Data Science: The Technologies Shaping Our Lives
 
ai based computer basic learning Lecture about Bigdata.ppt
ai based computer basic learning Lecture about Bigdata.pptai based computer basic learning Lecture about Bigdata.ppt
ai based computer basic learning Lecture about Bigdata.ppt
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Big Data World
Big Data WorldBig Data World
Big Data World
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Data analytics introduction
Data analytics introductionData analytics introduction
Data analytics introduction
 

Recently uploaded

一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
osoyvvf
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
NABLAS株式会社
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
asyed10
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
Timothy Spann
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
nyvan3
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
Márton Kodok
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 

Recently uploaded (20)

一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
一比一原版(uom毕业证书)曼彻斯特大学毕业证如何办理
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .社内勉強会資料_Hallucination of LLMs               .
社内勉強会資料_Hallucination of LLMs               .
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
一比一原版美国帕森斯设计学院毕业证(parsons毕业证书)如何办理
 
06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus06-18-2024-Princeton Meetup-Introduction to Milvus
06-18-2024-Princeton Meetup-Introduction to Milvus
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
一比一原版英国赫特福德大学毕业证(hertfordshire毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
Build applications with generative AI on Google Cloud
Build applications with generative AI on Google CloudBuild applications with generative AI on Google Cloud
Build applications with generative AI on Google Cloud
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 

Big data tutorial_part4

  • 1. Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Ljubljana, Slovenia Stavanger, May 8th 2012
  • 2.  Introduction ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?  Techniques  Tools  Applications  Literature
  • 3.
  • 4.
  • 5.  ‘Big-data’ is similar to ‘Small-data’, but bigger  …but having data bigger consequently requires different approaches: ◦ techniques, tools & architectures  …to solve: ◦ New problems… ◦ …and old problems in a better way.
  • 6. From “Understanding Big Data” by IBM
  • 7.
  • 9.  Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.  NoSQL ◦ DatabasesMongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper  MapReduce ◦ Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum  Storage ◦ S3, Hadoop Distributed File System  Servers ◦ EC2, Google App Engine, Elastic, Beanstalk, Heroku  Processing ◦ R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop
  • 21.
  • 22.  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model based analytics)… ◦ …as long as we deal with the scale
  • 23.  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are sub- cubes within the data cube Scalability Dynamicity Context Quality Usage
  • 24.
  • 25.
  • 26.  Good recommendations can make a big difference when keeping a user on a web site ◦ …the key is how rich context model a system is using to select information for a user ◦ Bad recommendations <1% users, good ones >5% users click Contextual personalized recommendations generated in ~20ms
  • 27.  Domain  Sub-domain  Page URL  URL sub-directories  Page Meta Tags  Page Title  Page Content  Named Entities  Has Query  Referrer Query  Referring Domain  Referring URL  Outgoing URL  GeoIP Country  GeoIP State  GeoIP City  Absolute Date  Day of the Week  Day period  Hour of the day  User Agent  Zip Code  State  Income  Age  Gender  Country  Job Title  Job Industry
  • 28. Log Files (~100M page clicks per day) User profiles NYT articles Stream of profiles Advertisers Segment Keywords Stock Market Stock Market, mortgage, banking, investors, Wall Street, turmoil, New York Stock Exchange Health diabetes, heart disease, disease, heart, illness Green Energy Hybrid cars, energy, power, model, carbonated, fuel, bulbs, Hybrid cars Hybrid cars, vehicles, model, engines, diesel Travel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant … … Segments Trend Detection System Stream of clicks Trends and updated segments Campaign to sell segments $ Sales
  • 29.  50Gb of uncompressed log files  10Gb of compressed log files  0.5Gb of processed log files  50-100M clicks  4-6M unique users  7000 unique pages with more then 100 hits  Index size 2Gb  Pre-processing & indexing time ◦ ~10min on workstation (4 cores & 32Gb) ◦ ~1hour on EC2 (2 cores & 16Gb)
  • 30.
  • 31.  Alarms Explorer Server implements three real-time scenarios on the alarms stream: 1. Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 2. Short-Term Fault Prediction – predict which device will fail in next 15mins 3. Long-Term Anomaly Detection – detect unusual trends in the network  …system is used in British Telecom Alarms Server Alarms Explorer Server Live feed of data Operator Big board display Telecom Network (~25 000 devices) Alarms ~10-100/sec
  • 32.  Presented in “Planetary-Scale Views on a Large Instant-Messaging Network” by Jure Leskovec and Eric Horvitz WWW2008
  • 33.  Observe social and communication phenomena at a planetary scale  Largest social network analyzed to date Research questions: How does communication change with user demographics (age, sex, language, country)? How does geography affect communication? What is the structure of the communication network? 33
  • 34.  We collected the data for June 2006  Log size: 150Gb/day (compressed)  Total: 1 month of communication data: 4.5Tb of compressed data  Activity over June 2006 (30 days) ◦ 245 million users logged in ◦ 180 million users engaged in conversations ◦ 17,5 million new accounts activated ◦ More than 30 billion conversations ◦ More than 255 billion exchanged messages 34
  • 35. 35
  • 36. 36
  • 37.  Count the number of users logging in from particular location on the earth 37
  • 38.  Logins from Europe 38
  • 39.  6 degrees of separation [Milgram ’60s]  Average distance between two random users is 6.6  90% of nodes can be reached in < 8 hops Hops Nodes 1 10 2 78 3 396 4 8648 5 3299252 6 28395849 7 79059497 8 52995778 9 10321008 10 1955007 11 518410 12 149945 13 44616 14 13740 15 4476 16 1542 17 536 18 167 19 71 20 29 21 16 22 10 23 3 24 2 25 3