SlideShare a Scribd company logo
1 of 58
Download to read offline
Marko Grobelnik
marko.grobelnik@ijs.si
Jozef Stefan Institute

Kalamaki, May 25th 2012


Introduction
◦ What is Big data?
◦ Why Big-Data?
◦ When Big-Data is really a problem?






Techniques
Tools
Applications
Literature




‘Big-data’ is similar to ‘Small-data’, but bigger
…but having data bigger consequently requires
different approaches:
◦ techniques, tools, architectures



…with an aim to solve new problems
◦ …and old problems in a better way.
From “Understanding Big Data” by IBM
Big-Data


Key enablers for the growth of “Big Data” are:
◦ Increase of storage capacities
◦ Increase of processing power
◦ Availability of data


Where processing is hosted?

◦ Distributed Servers / Cloud (e.g. Amazon EC2)



Where data is stored?

◦ Distributed Storage (e.g. Amazon S3)



What is the programming model?

◦ Distributed Processing (e.g. MapReduce)



How data is stored & indexed?

◦ High-performance schema-free databases (e.g.
MongoDB)



What operations are performed on data?

◦ Analytic / Semantic Processing (e.g. R, OWLIM)


Computing and storage are typically hosted
transparently on cloud infrastructures
◦ …providing scale, flexibility and high fail-safety



Distributed Servers
◦ Amazon-EC2, Google App Engine, Elastic,
Beanstalk, Heroku



Distributed Storage
◦ Amazon-S3, Hadoop Distributed File System


Distributed processing of Big-Data requires nonstandard programming models
◦ …beyond single machines or traditional parallel
programming models (like MPI)
◦ …the aim is to simplify complex programming tasks





The most popular programming model is
MapReduce approach
Implementations of MapReduce

◦ Hadoop (http://hadoop.apache.org/), Hive, Pig,
Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu,
Flume, Kafka, Azkaban, Oozie, Greenplum


The key idea of the MapReduce approach:
◦ A target problem needs to be parallelizable

◦ First, the problem gets split into a set of smaller problems (Map step)
◦ Next, smaller problems are solved in a parallel way
◦ Finally, a set of solutions to the smaller problems get synthesized
into a solution of the original problem (Reduce step)


NoSQL class of databases have in common:
◦
◦
◦
◦
◦



To support large amounts of data
Have mostly non-SQL interface
Operate on distributed infrastructures (e.g. Hadoop)
Are based on key-value pairs (no predefined schema)
…are flexible and fast

Implementations
◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
Hypertable, Voldemort, Riak, ZooKeeper…


…when the operations on data are complex:
◦ …e.g. simple counting is not a complex problem
◦ Modeling and reasoning with data of different kinds
can get extremely complex



Good news about big-data:
◦ Often, because of vast amount of data, modeling
techniques can get simpler (e.g. smart counting can
replace complex model-based analytics)…
◦ …as long as we deal with the scale


Research areas (such
as IR, KDD, ML, NLP,
SemWeb, …) are subcubes within the data
cube

Usage
Quality

Context
Streaming
Scalability




A risk with “Big-Data mining” is that an
analyst can “discover” patterns that are
meaningless
Statisticians call it Bonferroni’s principle:
◦ Roughly, if you look in more places for interesting
patterns than your amount of data will support, you
are bound to find crap

Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
Example:
 We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
◦
◦
◦
◦
◦


109 people being tracked.
1000 days.
Each person stays in a hotel 1% of the time (1 day out of 100)
Hotels hold 100 people (so 105 hotels).
If everyone behaves randomly (i.e., no terrorists) will the data
mining detect anything suspicious?

Expected number of “suspicious” pairs of people:

◦ 250,000
◦ … too many combinations to check – we need to have some
additional evidence to find “suspicious” pairs of people in
some more efficient way
Example taken from: Rajaraman, Ullman: Mining of Massive Datasets


Smart sampling of data
◦ …reducing the original data while not losing the
statistical properties of data



Finding similar items
◦ …efficient multidimensional indexing



Incremental updating of the models
◦ (vs. building models from scratch)
◦ …crucial for streaming data



Distributed linear algebra
◦ …dealing with large sparse matrices


On the top of the previous ops we perform
usual data mining/machine learning/statistics
operators:
◦ Supervised learning (classification, regression, …)
◦ Non-supervised learning (clustering, different types
of decompositions, …)
◦ …



…we are just more careful which algorithms
we choose (typically linear or sub-linear
versions)


An excellent overview of the algorithms
covering the above issues is the book
“Rajaraman, Ullman: Mining of Massive
Datasets”


Good recommendations
can make a big
difference when keeping
a user on a web site
◦ …the key is how rich the
context model a system is
using to select information
for a user
◦ Bad recommendations <1%
users, good ones >5% users
click
◦ 200clicks/sec
Contextual
personalized
recommendations
generated in ~20ms






Domain
Sub-domain
Page URL
URL sub-directories










Page Meta Tags
Page Title
Page Content
Named Entities






Has Query
Referrer Query




GeoIP Country
GeoIP State
GeoIP City



















Referring Domain
Referring URL
Outgoing URL





Absolute Date
Day of the Week
Day period
Hour of the day
User Agent

Zip Code
State
Income
Age
Gender
Country
Job Title
Job Industry
Trend Detection System
Log Files
(~100M
page clicks
per day)

Stream
of clicks

User
profiles

Stream of
profiles

Trends and
updated segments
NYT
articles

Segment

Keywords

Stock
Market

diabetes, heart disease, disease, heart,
illness

Green
Energy

Hybrid cars, energy, power, model,
carbonated, fuel, bulbs,

Hybrid cars

Hybrid cars, vehicles, model, engines,
diesel

Travel

travel, wine, opening, tickets, hotel,
sites, cars, search, restaurant

…

…

Segments

Stock Market, mortgage, banking,
investors, Wall Street, turmoil, New
York Stock Exchange

Health

Sales

$

Campaign
to sell
segments

Advertisers





50Gb of uncompressed log files
50-100M clicks
4-6M unique users
7000 unique pages with more then 100 hits
Alarms Server

Telecom
Network
(~25 000
devices)



Alarms

~10-100/sec

Live feed of data

Alarms
Explorer
Server

Alarms Explorer Server implements three
real-time scenarios on the alarms stream:
1. Root-Cause-Analysis – finding which device is
responsible for occasional “flood” of alarms
2. Short-Term Fault Prediction – predict which
device will fail in next 15mins
3. Long-Term Anomaly Detection – detect
unusual trends in the network



…system is used in British Telecom

Operator

Big board display


The aim of the project is to collect and analyze
most of the main-stream media across the world
◦ …from 35,000 publishers (180K RSS feeds) crawled in
real time (few ~10articles per second)
◦ …each article document gets extracted, cleaned,
semantically annotated, structure extracted



Challenges are in terms of complexity of
processing and querying of the extracted data
◦ …and matching textual information across the
languages (cross-lingual technologies)

http://render-project.eu/

http://www.xlike.org/
http://newsfeed.ijs.si/
Plain text

Extracted
graph
of triples
from text

Text
Enrichment

“Enrycher” is available as
as a web-service generating
Semantic Graph, LOD links,
Entities, Keywords, Categories,
Text Summarization


The aim is to use analytic techniques to
visualize documents in different ways:
◦ Topic view
◦ Social view
◦ Temporal view
Query
Search
Results
Topic Map
Selected
group of news

Selected
story
Query

Named
entities
in relation
US Elections
Query

US Budget

Result set
Topic Trends
Visualization

NATO-Russia
Mid-East
conflict

Topics
description
Dec 7th 1941
Apr 6th 1941
June 1944
Query
Conceptual map
Search Point
Dynamic
contextual
ranking based
on the search
point




Observe social and communication
phenomena at a planetary scale
Largest social network analyzed till 2010

Research questions:
 How does communication change with user
demographics (age, sex, language, country)?
 How does geography affect communication?
 What is the structure of the communication
network?
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

50








We collected the data for June 2006
Log size:
150Gb/day (compressed)
Total: 1 month of communication data:
4.5Tb of compressed data
Activity over June 2006 (30 days)
◦
◦
◦
◦
◦

245 million users logged in
180 million users engaged in conversations
17,5 million new accounts activated
More than 30 billion conversations
More than 255 billion exchanged messages

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

51
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

52
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

53


Count the number of users logging in from
particular location on the earth

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

54


Logins from Europe

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

55
Hops
1

396

4

8648

5

3299252

6

28395849

7

79059497

8

52995778

9

10321008

10

1955007

11

518410

12

149945

13

44616

14

13740

15

4476

16

1542

17

536

18

167

19



78

3



10

2



Nodes

71

20
6 degrees of separation [Milgram ’60s]
21
Average distance between two random users is 6.622
23
90% of nodes can be reached in < 8 hops

29

16
10
3

24

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008

2

25

3


Big-Data is everywhere, we are just not used to
deal with it



The “Big-Data” hype is very recent



Can we do “Big-Data” without big investment?

◦ …growth seems to be going up
◦ …evident lack of experts to build Big-Data apps

◦ …yes – many open source tools, computing machinery is
cheap (to buy or to rent)
◦ …the key is knowledge on how to deal with data
◦ …data is either free (e.g. Wikipedia) or to buy (e.g.
twitter)

More Related Content

What's hot

Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open DataIvan Herman
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationSören Auer
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked dataLaura Po
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichmentsemanticsconference
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesAlessandro Adamou
 
Memory Connected
Memory ConnectedMemory Connected
Memory ConnectedLi Ding
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Jeff Z. Pan
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph IntroductionSören Auer
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANSvty
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?Li Ding
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresJiaheng Lu
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisCraig Knoblock
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communicationSören Auer
 

What's hot (20)

Standardizing for Open Data
Standardizing for Open DataStandardizing for Open Data
Standardizing for Open Data
 
Linked data for Enterprise Data Integration
Linked data for Enterprise Data IntegrationLinked data for Enterprise Data Integration
Linked data for Enterprise Data Integration
 
SSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow TutorialSSSW2015 Data Workflow Tutorial
SSSW2015 Data Workflow Tutorial
 
Ziegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections LibrariesZiegler Open Data in Special Collections Libraries
Ziegler Open Data in Special Collections Libraries
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Linked Data to Improve the OER Experience
Linked Data to Improve the OER ExperienceLinked Data to Improve the OER Experience
Linked Data to Improve the OER Experience
 
Session 1.2 improving access to digital content by semantic enrichment
Session 1.2   improving access to digital content by semantic enrichmentSession 1.2   improving access to digital content by semantic enrichment
Session 1.2 improving access to digital content by semantic enrichment
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at ScaleFull Erdmann Ruttenberg Community Approaches to Open Data at Scale
Full Erdmann Ruttenberg Community Approaches to Open Data at Scale
 
Memory Connected
Memory ConnectedMemory Connected
Memory Connected
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
Linked Data
Linked DataLinked Data
Linked Data
 
Knowledge Graph Introduction
Knowledge Graph IntroductionKnowledge Graph Introduction
Knowledge Graph Introduction
 
Linked Open Data and DANS
Linked Open Data and DANSLinked Open Data and DANS
Linked Open Data and DANS
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Linked Open Government Data: What’s Next?
Linked Open Government Data:  What’s Next?Linked Open Government Data:  What’s Next?
Linked Open Government Data: What’s Next?
 
Multi-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated PolystoresMulti-model Databases and Tightly Integrated Polystores
Multi-model Databases and Tightly Integrated Polystores
 
Semantics for Big Data Integration and Analysis
Semantics for Big Data Integration and AnalysisSemantics for Big Data Integration and Analysis
Semantics for Big Data Integration and Analysis
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
Towards digitizing scholarly communication
Towards digitizing scholarly communicationTowards digitizing scholarly communication
Towards digitizing scholarly communication
 

Viewers also liked

Programming with LOD
Programming with LODProgramming with LOD
Programming with LODFumihiro Kato
 
Semantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebSemantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebThe Open University
 
DAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesDAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesOxford Tech + UX
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mariana Damova, Ph.D
 
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLeswcsummerschool
 

Viewers also liked (6)

Programming with LOD
Programming with LODProgramming with LOD
Programming with LOD
 
Semantics, Sensors, and the Social Web
Semantics, Sensors, and the Social WebSemantics, Sensors, and the Social Web
Semantics, Sensors, and the Social Web
 
IndustryInform Service of Mozaika
IndustryInform Service of MozaikaIndustryInform Service of Mozaika
IndustryInform Service of Mozaika
 
DAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data ExperiencesDAA Keynote- 99 Amazing Big Data Experiences
DAA Keynote- 99 Amazing Big Data Experiences
 
Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]Mapping Lo Dto Proton Revised [Compatibility Mode]
Mapping Lo Dto Proton Revised [Compatibility Mode]
 
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQLESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
ESWC SS 2012 - Monday Tutorial 2 Barry Norton: Introduction to SPARQL
 

Similar to ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Marko Grobelnik
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4heyramzz
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4GV prasad
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4Pragati Singh
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introductionbutest
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationRaffael Marty
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.pptadmsoyadm4
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1DanWooster1
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introductionhktripathy
 

Similar to ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial (20)

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012Big Data Tutorial - Marko Grobelnik - 25 May 2012
Big Data Tutorial - Marko Grobelnik - 25 May 2012
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Big Data Analytics V2
Big Data Analytics V2Big Data Analytics V2
Big Data Analytics V2
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial
Big data tutorialBig data tutorial
Big data tutorial
 
Chapter 1. Introduction
Chapter 1. IntroductionChapter 1. Introduction
Chapter 1. Introduction
 
Chapter 1. Introduction.ppt
Chapter 1. Introduction.pptChapter 1. Introduction.ppt
Chapter 1. Introduction.ppt
 
Introduction to data warehouse
Introduction to data warehouseIntroduction to data warehouse
Introduction to data warehouse
 
Delivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and VisualizationDelivering Security Insights with Data Analytics and Visualization
Delivering Security Insights with Data Analytics and Visualization
 
Data Mining Intro
Data Mining IntroData Mining Intro
Data Mining Intro
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt01Introduction to data mining chapter 1.ppt
01Introduction to data mining chapter 1.ppt
 
01Intro.ppt
01Intro.ppt01Intro.ppt
01Intro.ppt
 
data mining
data miningdata mining
data mining
 
Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1Upstate CSCI 525 Data Mining Chapter 1
Upstate CSCI 525 Data Mining Chapter 1
 
Lect 1 introduction
Lect 1 introductionLect 1 introduction
Lect 1 introduction
 

More from eswcsummerschool

Semantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projectSemantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projecteswcsummerschool
 
Syrtaki - ESWC SSchool 14 - Student project
Syrtaki  - ESWC SSchool 14 - Student projectSyrtaki  - ESWC SSchool 14 - Student project
Syrtaki - ESWC SSchool 14 - Student projecteswcsummerschool
 
Keep fit (a bit) - ESWC SSchool 14 - Student project
Keep fit (a bit)  - ESWC SSchool 14 - Student projectKeep fit (a bit)  - ESWC SSchool 14 - Student project
Keep fit (a bit) - ESWC SSchool 14 - Student projecteswcsummerschool
 
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projectArabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projecteswcsummerschool
 
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projectFIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projecteswcsummerschool
 
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
Personal Tours at the British Museum  - ESWC SSchool 14 - Student projectPersonal Tours at the British Museum  - ESWC SSchool 14 - Student project
Personal Tours at the British Museum - ESWC SSchool 14 - Student projecteswcsummerschool
 
Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...eswcsummerschool
 
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...eswcsummerschool
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 eswcsummerschool
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014eswcsummerschool
 
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 eswcsummerschool
 
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...eswcsummerschool
 
Mon norton tut_publishing01
Mon norton tut_publishing01Mon norton tut_publishing01
Mon norton tut_publishing01eswcsummerschool
 
Mon domingue introduction to the school
Mon domingue introduction to the schoolMon domingue introduction to the school
Mon domingue introduction to the schooleswcsummerschool
 
Mon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataMon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataeswcsummerschool
 
Tue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataTue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataeswcsummerschool
 
Thu bernstein key_warp_speed
Thu bernstein key_warp_speedThu bernstein key_warp_speed
Thu bernstein key_warp_speedeswcsummerschool
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringeswcsummerschool
 
Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02eswcsummerschool
 
Mon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataMon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataeswcsummerschool
 

More from eswcsummerschool (20)

Semantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student projectSemantic Aquarium - ESWC SSchool 14 - Student project
Semantic Aquarium - ESWC SSchool 14 - Student project
 
Syrtaki - ESWC SSchool 14 - Student project
Syrtaki  - ESWC SSchool 14 - Student projectSyrtaki  - ESWC SSchool 14 - Student project
Syrtaki - ESWC SSchool 14 - Student project
 
Keep fit (a bit) - ESWC SSchool 14 - Student project
Keep fit (a bit)  - ESWC SSchool 14 - Student projectKeep fit (a bit)  - ESWC SSchool 14 - Student project
Keep fit (a bit) - ESWC SSchool 14 - Student project
 
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student projectArabic Sentiment Lexicon - ESWC SSchool 14 - Student project
Arabic Sentiment Lexicon - ESWC SSchool 14 - Student project
 
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student projectFIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
FIT-8BIT An activity music assistant - ESWC SSchool 14 - Student project
 
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
Personal Tours at the British Museum  - ESWC SSchool 14 - Student projectPersonal Tours at the British Museum  - ESWC SSchool 14 - Student project
Personal Tours at the British Museum - ESWC SSchool 14 - Student project
 
Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...Exhibition recommendation using British Museum data and Event Registry - ESWC...
Exhibition recommendation using British Museum data and Event Registry - ESWC...
 
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
Empowering fishing business using Linked Data - ESWC SSchool 14 - Student pro...
 
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014 Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
Tutorial: Social Semantic Web and Crowdsourcing - E. Simperl - ESWC SS 2014
 
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
Keynote: Global Media Monitoring - M. Grobelnik - ESWC SS 2014
 
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014 Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
Hands On: Amazon Mechanical Turk - M. Acosta - ESWC SS 2014
 
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
Tutorial: Querying a Marine Data Warehouse Using SPARQL - I. Fundulaki - ESWC...
 
Mon norton tut_publishing01
Mon norton tut_publishing01Mon norton tut_publishing01
Mon norton tut_publishing01
 
Mon domingue introduction to the school
Mon domingue introduction to the schoolMon domingue introduction to the school
Mon domingue introduction to the school
 
Mon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage dataMon norton tut_querying cultural heritage data
Mon norton tut_querying cultural heritage data
 
Tue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddataTue acosta hands_on_providinglinkeddata
Tue acosta hands_on_providinglinkeddata
 
Thu bernstein key_warp_speed
Thu bernstein key_warp_speedThu bernstein key_warp_speed
Thu bernstein key_warp_speed
 
Fri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineeringFri schreiber key_knowledge engineering
Fri schreiber key_knowledge engineering
 
Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02Mon norton tut_queryinglinkeddata02
Mon norton tut_queryinglinkeddata02
 
Mon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked dataMon fundulaki tut_querying linked data
Mon fundulaki tut_querying linked data
 

Recently uploaded

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureAntonio de Llamas
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationBuild Intuit
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfwill854175
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationDianaGray10
 
Bitdefender-CSG-Report-creat7534-interactive
Bitdefender-CSG-Report-creat7534-interactiveBitdefender-CSG-Report-creat7534-interactive
Bitdefender-CSG-Report-creat7534-interactivestartupro
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2DianaGray10
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdfHulkTheDevil
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 

Recently uploaded (20)

QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Tecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for RotogravureTecnogravura, Cylinder Engraving for Rotogravure
Tecnogravura, Cylinder Engraving for Rotogravure
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
BoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another CenturyBoSEU24 | Bill Thompson | Talk From Another Century
BoSEU24 | Bill Thompson | Talk From Another Century
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Dynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientationDynamical Context introduction word sensibility orientation
Dynamical Context introduction word sensibility orientation
 
Arti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdfArti Languages Pre Seed Pitchdeck 2024.pdf
Arti Languages Pre Seed Pitchdeck 2024.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Women in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automationWomen in Automation 2024: Career session - explore career paths in automation
Women in Automation 2024: Career session - explore career paths in automation
 
Bitdefender-CSG-Report-creat7534-interactive
Bitdefender-CSG-Report-creat7534-interactiveBitdefender-CSG-Report-creat7534-interactive
Bitdefender-CSG-Report-creat7534-interactive
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
Efficiencies in RPA with UiPath and CyberArk Technologies - Session 2
 
full stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdffull stack practical assignment msc cs.pdf
full stack practical assignment msc cs.pdf
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 

ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial

  • 1. Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Kalamaki, May 25th 2012
  • 2.  Introduction ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?     Techniques Tools Applications Literature
  • 3.
  • 4.
  • 5.   ‘Big-data’ is similar to ‘Small-data’, but bigger …but having data bigger consequently requires different approaches: ◦ techniques, tools, architectures  …with an aim to solve new problems ◦ …and old problems in a better way.
  • 6. From “Understanding Big Data” by IBM
  • 7.
  • 9.  Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.  Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)  What is the programming model? ◦ Distributed Processing (e.g. MapReduce)  How data is stored & indexed? ◦ High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
  • 21.  Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety  Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic, Beanstalk, Heroku  Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
  • 22.  Distributed processing of Big-Data requires nonstandard programming models ◦ …beyond single machines or traditional parallel programming models (like MPI) ◦ …the aim is to simplify complex programming tasks   The most popular programming model is MapReduce approach Implementations of MapReduce ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • 23.  The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable ◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
  • 24.  NoSQL class of databases have in common: ◦ ◦ ◦ ◦ ◦  To support large amounts of data Have mostly non-SQL interface Operate on distributed infrastructures (e.g. Hadoop) Are based on key-value pairs (no predefined schema) …are flexible and fast Implementations ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper…
  • 25.
  • 26.  …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model-based analytics)… ◦ …as long as we deal with the scale
  • 27.  Research areas (such as IR, KDD, ML, NLP, SemWeb, …) are subcubes within the data cube Usage Quality Context Streaming Scalability
  • 28.   A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless Statisticians call it Bonferroni’s principle: ◦ Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 29. Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ ◦ ◦ ◦ ◦  109 people being tracked. 1000 days. Each person stays in a hotel 1% of the time (1 day out of 100) Hotels hold 100 people (so 105 hotels). If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious? Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 30.  Smart sampling of data ◦ …reducing the original data while not losing the statistical properties of data  Finding similar items ◦ …efficient multidimensional indexing  Incremental updating of the models ◦ (vs. building models from scratch) ◦ …crucial for streaming data  Distributed linear algebra ◦ …dealing with large sparse matrices
  • 31.  On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …) ◦ Non-supervised learning (clustering, different types of decompositions, …) ◦ …  …we are just more careful which algorithms we choose (typically linear or sub-linear versions)
  • 32.  An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Ullman: Mining of Massive Datasets”
  • 33.
  • 34.  Good recommendations can make a big difference when keeping a user on a web site ◦ …the key is how rich the context model a system is using to select information for a user ◦ Bad recommendations <1% users, good ones >5% users click ◦ 200clicks/sec Contextual personalized recommendations generated in ~20ms
  • 35.     Domain Sub-domain Page URL URL sub-directories       Page Meta Tags Page Title Page Content Named Entities    Has Query Referrer Query   GeoIP Country GeoIP State GeoIP City            Referring Domain Referring URL Outgoing URL    Absolute Date Day of the Week Day period Hour of the day User Agent Zip Code State Income Age Gender Country Job Title Job Industry
  • 36. Trend Detection System Log Files (~100M page clicks per day) Stream of clicks User profiles Stream of profiles Trends and updated segments NYT articles Segment Keywords Stock Market diabetes, heart disease, disease, heart, illness Green Energy Hybrid cars, energy, power, model, carbonated, fuel, bulbs, Hybrid cars Hybrid cars, vehicles, model, engines, diesel Travel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant … … Segments Stock Market, mortgage, banking, investors, Wall Street, turmoil, New York Stock Exchange Health Sales $ Campaign to sell segments Advertisers
  • 37.     50Gb of uncompressed log files 50-100M clicks 4-6M unique users 7000 unique pages with more then 100 hits
  • 38. Alarms Server Telecom Network (~25 000 devices)  Alarms ~10-100/sec Live feed of data Alarms Explorer Server Alarms Explorer Server implements three real-time scenarios on the alarms stream: 1. Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 2. Short-Term Fault Prediction – predict which device will fail in next 15mins 3. Long-Term Anomaly Detection – detect unusual trends in the network  …system is used in British Telecom Operator Big board display
  • 39.  The aim of the project is to collect and analyze most of the main-stream media across the world ◦ …from 35,000 publishers (180K RSS feeds) crawled in real time (few ~10articles per second) ◦ …each article document gets extracted, cleaned, semantically annotated, structure extracted  Challenges are in terms of complexity of processing and querying of the extracted data ◦ …and matching textual information across the languages (cross-lingual technologies) http://render-project.eu/ http://www.xlike.org/
  • 41. Plain text Extracted graph of triples from text Text Enrichment “Enrycher” is available as as a web-service generating Semantic Graph, LOD links, Entities, Keywords, Categories, Text Summarization
  • 42.  The aim is to use analytic techniques to visualize documents in different ways: ◦ Topic view ◦ Social view ◦ Temporal view
  • 45. US Elections Query US Budget Result set Topic Trends Visualization NATO-Russia Mid-East conflict Topics description
  • 50.   Observe social and communication phenomena at a planetary scale Largest social network analyzed till 2010 Research questions:  How does communication change with user demographics (age, sex, language, country)?  How does geography affect communication?  What is the structure of the communication network? “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 50
  • 51.     We collected the data for June 2006 Log size: 150Gb/day (compressed) Total: 1 month of communication data: 4.5Tb of compressed data Activity over June 2006 (30 days) ◦ ◦ ◦ ◦ ◦ 245 million users logged in 180 million users engaged in conversations 17,5 million new accounts activated More than 30 billion conversations More than 255 billion exchanged messages “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 51
  • 52. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 52
  • 53. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 53
  • 54.  Count the number of users logging in from particular location on the earth “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 54
  • 55.  Logins from Europe “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 55
  • 56. Hops 1 396 4 8648 5 3299252 6 28395849 7 79059497 8 52995778 9 10321008 10 1955007 11 518410 12 149945 13 44616 14 13740 15 4476 16 1542 17 536 18 167 19  78 3  10 2  Nodes 71 20 6 degrees of separation [Milgram ’60s] 21 Average distance between two random users is 6.622 23 90% of nodes can be reached in < 8 hops 29 16 10 3 24 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 2 25 3
  • 57.
  • 58.  Big-Data is everywhere, we are just not used to deal with it  The “Big-Data” hype is very recent  Can we do “Big-Data” without big investment? ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps ◦ …yes – many open source tools, computing machinery is cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g. twitter)