SlideShare a Scribd company logo
Marko Grobelnik
marko.grobelnik@ijs.si
 Jozef Stefan Institute




                     Kalamaki, May 25th 2012
   Introduction
    ◦ What is Big data?
    ◦ Why Big-Data?
    ◦ When Big-Data is really a problem?
   Techniques
   Tools
   Applications
   Literature
   ‘Big-data’ is similar to ‘Small-data’, but bigger

   …but having data bigger consequently requires
    different approaches:
    ◦ techniques, tools, architectures


   …with an aim to solve new problems
    ◦ …and old problems in a better way.
From “Understanding Big Data” by IBM
Big-Data
   Key enablers for the growth of “Big Data” are:

    ◦ Increase of storage capacities

    ◦ Increase of processing power

    ◦ Availability of data
   Where processing is hosted?
    ◦ Distributed Servers / Cloud (e.g. Amazon EC2)
   Where data is stored?
    ◦ Distributed Storage (e.g. Amazon S3)
   What is the programming model?
    ◦ Distributed Processing (e.g. MapReduce)
   How data is stored & indexed?
    ◦ High-performance schema-free databases (e.g.
      MongoDB)
   What operations are performed on data?
    ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
   Computing and storage are typically hosted
    transparently on cloud infrastructures
    ◦ …providing scale, flexibility and high fail-safety


   Distributed Servers
    ◦ Amazon-EC2, Google App Engine, Elastic,
      Beanstalk, Heroku
   Distributed Storage
    ◦ Amazon-S3, Hadoop Distributed File System
   Distributed processing of Big-Data requires non-
    standard programming models
    ◦ …beyond single machines or traditional parallel
      programming models (like MPI)
    ◦ …the aim is to simplify complex programming tasks

   The most popular programming model is
    MapReduce approach

   Implementations of MapReduce
    ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig,
      Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu,
      Flume, Kafka, Azkaban, Oozie, Greenplum
   The key idea of the MapReduce approach:
    ◦ A target problem needs to be parallelizable

    ◦ First, the problem gets split into a set of smaller problems (Map step)
    ◦ Next, smaller problems are solved in a parallel way
    ◦ Finally, a set of solutions to the smaller problems get synthesized
      into a solution of the original problem (Reduce step)
   NoSQL class of databases have in common:
    ◦   To support large amounts of data
    ◦   Have mostly non-SQL interface
    ◦   Operate on distributed infrastructures (e.g. Hadoop)
    ◦   Are based on key-value pairs (no predefined schema)
    ◦   …are flexible and fast
   Implementations
    ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
      Hypertable, Voldemort, Riak, ZooKeeper…
   …when the operations on data are complex:
    ◦ …e.g. simple counting is not a complex problem
    ◦ Modeling and reasoning with data of different kinds
      can get extremely complex

   Good news about big-data:
    ◦ Often, because of vast amount of data, modeling
      techniques can get simpler (e.g. smart counting can
      replace complex model-based analytics)…
    ◦ …as long as we deal with the scale
   Research areas (such
    as IR, KDD, ML, NLP,
                            Usage
    SemWeb, …) are sub-
    cubes within the data   Quality
    cube
                            Context

                            Streaming

                            Scalability
   A risk with “Big-Data mining” is that an
    analyst can “discover” patterns that are
    meaningless
   Statisticians call it Bonferroni’s principle:
    ◦ Roughly, if you look in more places for interesting
      patterns than your amount of data will support, you
      are bound to find crap




                    Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
Example:
 We want to find (unrelated) people who at least twice
  have stayed at the same hotel on the same day
    ◦   109 people being tracked.
    ◦   1000 days.
    ◦   Each person stays in a hotel 1% of the time (1 day out of 100)
    ◦   Hotels hold 100 people (so 105 hotels).
    ◦   If everyone behaves randomly (i.e., no terrorists) will the data
        mining detect anything suspicious?
   Expected number of “suspicious” pairs of people:
    ◦ 250,000
    ◦ … too many combinations to check – we need to have some
      additional evidence to find “suspicious” pairs of people in
      some more efficient way


                           Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
   Smart sampling of data
    ◦ …reducing the original data while not losing the
      statistical properties of data
   Finding similar items
    ◦ …efficient multidimensional indexing
   Incremental updating of the models
    ◦ (vs. building models from scratch)
    ◦ …crucial for streaming data
   Distributed linear algebra
    ◦ …dealing with large sparse matrices
   On the top of the previous ops we perform
    usual data mining/machine learning/statistics
    operators:
    ◦ Supervised learning (classification, regression, …)
    ◦ Non-supervised learning (clustering, different types
      of decompositions, …)
    ◦ …


   …we are just more careful which algorithms
    we choose (typically linear or sub-linear
    versions)
   An excellent overview of the algorithms
    covering the above issues is the book
    “Rajaraman, Ullman: Mining of Massive
    Datasets”
   Good recommendations
    can make a big
    difference when keeping
    a user on a web site
    ◦ …the key is how rich the
      context model a system is
      using to select information
      for a user
    ◦ Bad recommendations <1%
      users, good ones >5% users
      click
    ◦ 200clicks/sec

                      Contextual
                     personalized
                  recommendations
                 generated in ~20ms
   Domain                   Referring Domain      Zip Code
   Sub-domain               Referring URL         State
   Page URL                 Outgoing URL          Income
   URL sub-directories                             Age
                             GeoIP Country         Gender
   Page Meta Tags           GeoIP State           Country
   Page Title               GeoIP City            Job Title
   Page Content                                    Job Industry
   Named Entities           Absolute Date
                             Day of the Week
   Has Query                Day period
   Referrer Query           Hour of the day
                             User Agent
Trend Detection System

                            User                   Stream of
 Log Files     Stream
              of clicks    profiles                 profiles
  (~100M
page clicks
 per day)


                                                                                                       Sales
                          Trends and
                          updated segments                                          Segments
                          Segment       Keywords

        NYT               Stock         Stock Market, mortgage, banking,
                          Market        investors, Wall Street, turmoil, New
       articles                         York Stock Exchange
                                                                                                Campaign
                          Health        diabetes, heart disease, disease, heart,
                                        illness                                                   to sell
                                                                                                segments
                                                                                      $
                          Green         Hybrid cars, energy, power, model,
                          Energy        carbonated, fuel, bulbs,

                          Hybrid cars   Hybrid cars, vehicles, model, engines,
                                        diesel

                          Travel        travel, wine, opening, tickets, hotel,
                                        sites, cars, search, restaurant


                                                                                               Advertisers
                          …             …
   50Gb of uncompressed log files
   50-100M clicks
   4-6M unique users
   7000 unique pages with more then 100 hits
Alarms Server

      Telecom
      Network                                                          Alarms
                   Alarms                       Live feed of data      Explorer
     (~25 000     ~10-100/sec
      devices)                                                         Server


   Alarms Explorer Server implements three
    real-time scenarios on the alarms stream:
    1. Root-Cause-Analysis – finding which device is
       responsible for occasional “flood” of alarms
    2. Short-Term Fault Prediction – predict which
       device will fail in next 15mins
    3. Long-Term Anomaly Detection – detect
       unusual trends in the network
   …system is used in British Telecom


                                                       Operator     Big board display
   The aim of the project is to collect and analyze
    most of the main-stream media across the world
    ◦ …from 35,000 publishers (180K RSS feeds) crawled in
      real time (few ~10articles per second)
    ◦ …each article document gets extracted, cleaned,
      semantically annotated, structure extracted

   Challenges are in terms of complexity of
    processing and querying of the extracted data
    ◦ …and matching textual information across the
      languages (cross-lingual technologies)


            http://render-project.eu/   http://www.xlike.org/
http://newsfeed.ijs.si/
Extracted
                         graph
                         of triples
    Plain text           from text




                     Text
                  Enrichment




“Enrycher” is available as
as a web-service generating
Semantic Graph, LOD links,
Entities, Keywords, Categories,
Text Summarization
   The aim is to use analytic techniques to
    visualize documents in different ways:
    ◦ Topic view
    ◦ Social view
    ◦ Temporal view
Query

Search
Results

 Topic Map


Selected
group of news




 Selected
 story
Query




Named
entities
in relation
US Elections
                               US Budget
  Query


 Result set

                    NATO-Russia
Topic Trends
Visualization
                               Mid-East
                               conflict



 Topics
 description
Dec 7th 1941
Apr 6th 1941
June 1944
Query
Conceptual map

Search Point


 Dynamic
 contextual
 ranking based
 on the search
 point
   Observe social and communication
     phenomena at a planetary scale
    Largest social network analyzed till 2010

 Research questions:
  How does communication change with user
   demographics (age, sex, language, country)?
  How does geography affect communication?
  What is the structure of the communication
   network?

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          50
   We collected the data for June 2006
     Log size:
         150Gb/day (compressed)
     Total: 1 month of communication data:
         4.5Tb of compressed data
     Activity over June 2006 (30 days)
      ◦   245 million users logged in
      ◦   180 million users engaged in conversations
      ◦   17,5 million new accounts activated
      ◦   More than 30 billion conversations
      ◦   More than 255 billion exchanged messages
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          51
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   52
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   53
   Count the number of users logging in from
     particular location on the earth
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          54
   Logins from Europe




“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   55
Hops     Nodes
                                                                                              1         10
                                                                                              2         78
                                                                                              3        396
                                                                                              4       8648
                                                                                              5     3299252
                                                                                              6    28395849
                                                                                              7    79059497
                                                                                              8    52995778
                                                                                              9    10321008
                                                                                              10    1955007
                                                                                              11    518410
                                                                                              12    149945
                                                                                              13     44616
                                                                                              14     13740
                                                                                              15      4476
                                                                                              16      1542
                                                                                              17       536
                                                                                              18       167
                                                                                              19        71

   6 degrees of separation [Milgram ’60s]          20                                                  29


    Average distance between two random users is 6.622
                                                    21                                                  16
                                                                                                       10
   90% of nodes can be reached in < 8 hops         23                                                   3
                                                                                              24         2
    “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   25         3
   Big-Data is everywhere, we are just not used to
    deal with it

   The “Big-Data” hype is very recent
    ◦ …growth seems to be going up
    ◦ …evident lack of experts to build Big-Data apps

   Can we do “Big-Data” without big investment?
    ◦ …yes – many open source tools, computing machinery is
      cheap (to buy or to rent)
    ◦ …the key is knowledge on how to deal with data
    ◦ …data is either free (e.g. Wikipedia) or to buy (e.g.
      twitter)

More Related Content

What's hot

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
boorad
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
Amrit Chhetri
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
Kenny Daniel
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
mustafa sarac
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
RojaT4
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
Apache Apex
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
SahilRaina21
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
MIT College Of Engineering,Pune
 
Big data ppt
Big data pptBig data ppt
Big data ppt
Shweta Sahu
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
Kathirvel Ayyaswamy
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
RojaT4
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
RojaT4
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
Kamalika Dutta
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
SwarnaLatha177
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
Bart Vandewoestyne
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
Indu Khemchandani
 

What's hot (20)

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 

Viewers also liked

Introduction to mining massive datasets
Introduction to mining massive datasetsIntroduction to mining massive datasets
Introduction to mining massive datasets
Viet-Trung TRAN
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
karthika karthi
 
Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章
Makoto Kawano
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
Great Wide Open
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
Karthik Padmanabhan ( MLE℠)
 
Data Analytics using R
Data Analytics using RData Analytics using R
Data Analytics using R
richards9696
 
Intoroduction of Pandas with Python
Intoroduction of Pandas with PythonIntoroduction of Pandas with Python
Intoroduction of Pandas with PythonAtsushi Hayakawa
 
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesDIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesBjarne Viken
 
RHadoop
RHadoopRHadoop
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
Ajay Ohri
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
Gerard de Melo
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
Ajay Ohri
 
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionPoverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionAdaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Above
kmurray230
 

Viewers also liked (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to mining massive datasets
Introduction to mining massive datasetsIntroduction to mining massive datasets
Introduction to mining massive datasets
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Data Analytics using R
Data Analytics using RData Analytics using R
Data Analytics using R
 
Intoroduction of Pandas with Python
Intoroduction of Pandas with PythonIntoroduction of Pandas with Python
Intoroduction of Pandas with Python
 
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesDIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
 
RHadoop
RHadoopRHadoop
RHadoop
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionPoverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
 
3d Views Portfolio
3d Views Portfolio3d Views Portfolio
3d Views Portfolio
 
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionAdaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Above
 

Similar to Big Data Tutorial - Marko Grobelnik - 25 May 2012

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
European Data Forum
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
Aravindharamanan S
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
GV prasad
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
Pragati Singh
 
The information supernova
The information supernovaThe information supernova
The information supernova
Alaa Al-Agamawi
 
Big data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-businessBig data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-business
Amit Bhargava
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
nkabra
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
RojaT4
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
thamizh arasi
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
Steve Omohundro
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
Aniekan Akpaffiong
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
InnoTech
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
STS FORUM 2016
 

Similar to Big Data Tutorial - Marko Grobelnik - 25 May 2012 (20)

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial
Big data tutorialBig data tutorial
Big data tutorial
 
The information supernova
The information supernovaThe information supernova
The information supernova
 
Big data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-businessBig data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-business
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Data mining
Data miningData mining
Data mining
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
 

Recently uploaded

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 

Recently uploaded (20)

Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 

Big Data Tutorial - Marko Grobelnik - 25 May 2012

  • 1. Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Kalamaki, May 25th 2012
  • 2. Introduction ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?  Techniques  Tools  Applications  Literature
  • 3.
  • 4.
  • 5. ‘Big-data’ is similar to ‘Small-data’, but bigger  …but having data bigger consequently requires different approaches: ◦ techniques, tools, architectures  …with an aim to solve new problems ◦ …and old problems in a better way.
  • 6. From “Understanding Big Data” by IBM
  • 7.
  • 9. Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)  What is the programming model? ◦ Distributed Processing (e.g. MapReduce)  How data is stored & indexed? ◦ High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
  • 21. Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety  Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic, Beanstalk, Heroku  Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
  • 22. Distributed processing of Big-Data requires non- standard programming models ◦ …beyond single machines or traditional parallel programming models (like MPI) ◦ …the aim is to simplify complex programming tasks  The most popular programming model is MapReduce approach  Implementations of MapReduce ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • 23. The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable ◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
  • 24. NoSQL class of databases have in common: ◦ To support large amounts of data ◦ Have mostly non-SQL interface ◦ Operate on distributed infrastructures (e.g. Hadoop) ◦ Are based on key-value pairs (no predefined schema) ◦ …are flexible and fast  Implementations ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper…
  • 25.
  • 26. …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model-based analytics)… ◦ …as long as we deal with the scale
  • 27. Research areas (such as IR, KDD, ML, NLP, Usage SemWeb, …) are sub- cubes within the data Quality cube Context Streaming Scalability
  • 28. A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless  Statisticians call it Bonferroni’s principle: ◦ Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 29. Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 30. Smart sampling of data ◦ …reducing the original data while not losing the statistical properties of data  Finding similar items ◦ …efficient multidimensional indexing  Incremental updating of the models ◦ (vs. building models from scratch) ◦ …crucial for streaming data  Distributed linear algebra ◦ …dealing with large sparse matrices
  • 31. On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …) ◦ Non-supervised learning (clustering, different types of decompositions, …) ◦ …  …we are just more careful which algorithms we choose (typically linear or sub-linear versions)
  • 32. An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Ullman: Mining of Massive Datasets”
  • 33.
  • 34. Good recommendations can make a big difference when keeping a user on a web site ◦ …the key is how rich the context model a system is using to select information for a user ◦ Bad recommendations <1% users, good ones >5% users click ◦ 200clicks/sec Contextual personalized recommendations generated in ~20ms
  • 35. Domain  Referring Domain  Zip Code  Sub-domain  Referring URL  State  Page URL  Outgoing URL  Income  URL sub-directories  Age  GeoIP Country  Gender  Page Meta Tags  GeoIP State  Country  Page Title  GeoIP City  Job Title  Page Content  Job Industry  Named Entities  Absolute Date  Day of the Week  Has Query  Day period  Referrer Query  Hour of the day  User Agent
  • 36. Trend Detection System User Stream of Log Files Stream of clicks profiles profiles (~100M page clicks per day) Sales Trends and updated segments Segments Segment Keywords NYT Stock Stock Market, mortgage, banking, Market investors, Wall Street, turmoil, New articles York Stock Exchange Campaign Health diabetes, heart disease, disease, heart, illness to sell segments $ Green Hybrid cars, energy, power, model, Energy carbonated, fuel, bulbs, Hybrid cars Hybrid cars, vehicles, model, engines, diesel Travel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant Advertisers … …
  • 37. 50Gb of uncompressed log files  50-100M clicks  4-6M unique users  7000 unique pages with more then 100 hits
  • 38. Alarms Server Telecom Network Alarms Alarms Live feed of data Explorer (~25 000 ~10-100/sec devices) Server  Alarms Explorer Server implements three real-time scenarios on the alarms stream: 1. Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 2. Short-Term Fault Prediction – predict which device will fail in next 15mins 3. Long-Term Anomaly Detection – detect unusual trends in the network  …system is used in British Telecom Operator Big board display
  • 39. The aim of the project is to collect and analyze most of the main-stream media across the world ◦ …from 35,000 publishers (180K RSS feeds) crawled in real time (few ~10articles per second) ◦ …each article document gets extracted, cleaned, semantically annotated, structure extracted  Challenges are in terms of complexity of processing and querying of the extracted data ◦ …and matching textual information across the languages (cross-lingual technologies) http://render-project.eu/ http://www.xlike.org/
  • 41. Extracted graph of triples Plain text from text Text Enrichment “Enrycher” is available as as a web-service generating Semantic Graph, LOD links, Entities, Keywords, Categories, Text Summarization
  • 42. The aim is to use analytic techniques to visualize documents in different ways: ◦ Topic view ◦ Social view ◦ Temporal view
  • 45. US Elections US Budget Query Result set NATO-Russia Topic Trends Visualization Mid-East conflict Topics description
  • 49. Query Conceptual map Search Point Dynamic contextual ranking based on the search point
  • 50. Observe social and communication phenomena at a planetary scale  Largest social network analyzed till 2010 Research questions:  How does communication change with user demographics (age, sex, language, country)?  How does geography affect communication?  What is the structure of the communication network? “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 50
  • 51. We collected the data for June 2006  Log size: 150Gb/day (compressed)  Total: 1 month of communication data: 4.5Tb of compressed data  Activity over June 2006 (30 days) ◦ 245 million users logged in ◦ 180 million users engaged in conversations ◦ 17,5 million new accounts activated ◦ More than 30 billion conversations ◦ More than 255 billion exchanged messages “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 51
  • 52. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 52
  • 53. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 53
  • 54. Count the number of users logging in from particular location on the earth “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 54
  • 55. Logins from Europe “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 55
  • 56. Hops Nodes 1 10 2 78 3 396 4 8648 5 3299252 6 28395849 7 79059497 8 52995778 9 10321008 10 1955007 11 518410 12 149945 13 44616 14 13740 15 4476 16 1542 17 536 18 167 19 71  6 degrees of separation [Milgram ’60s] 20 29 Average distance between two random users is 6.622 21 16  10  90% of nodes can be reached in < 8 hops 23 3 24 2 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 25 3
  • 57.
  • 58. Big-Data is everywhere, we are just not used to deal with it  The “Big-Data” hype is very recent ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps  Can we do “Big-Data” without big investment? ◦ …yes – many open source tools, computing machinery is cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g. twitter)