SlideShare a Scribd company logo
1 of 58
Download to read offline
Marko Grobelnik
marko.grobelnik@ijs.si
 Jozef Stefan Institute




                     Kalamaki, May 25th 2012
   Introduction
    ◦ What is Big data?
    ◦ Why Big-Data?
    ◦ When Big-Data is really a problem?
   Techniques
   Tools
   Applications
   Literature
   ‘Big-data’ is similar to ‘Small-data’, but bigger

   …but having data bigger consequently requires
    different approaches:
    ◦ techniques, tools, architectures


   …with an aim to solve new problems
    ◦ …and old problems in a better way.
From “Understanding Big Data” by IBM
Big-Data
   Key enablers for the growth of “Big Data” are:

    ◦ Increase of storage capacities

    ◦ Increase of processing power

    ◦ Availability of data
   Where processing is hosted?
    ◦ Distributed Servers / Cloud (e.g. Amazon EC2)
   Where data is stored?
    ◦ Distributed Storage (e.g. Amazon S3)
   What is the programming model?
    ◦ Distributed Processing (e.g. MapReduce)
   How data is stored & indexed?
    ◦ High-performance schema-free databases (e.g.
      MongoDB)
   What operations are performed on data?
    ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
   Computing and storage are typically hosted
    transparently on cloud infrastructures
    ◦ …providing scale, flexibility and high fail-safety


   Distributed Servers
    ◦ Amazon-EC2, Google App Engine, Elastic,
      Beanstalk, Heroku
   Distributed Storage
    ◦ Amazon-S3, Hadoop Distributed File System
   Distributed processing of Big-Data requires non-
    standard programming models
    ◦ …beyond single machines or traditional parallel
      programming models (like MPI)
    ◦ …the aim is to simplify complex programming tasks

   The most popular programming model is
    MapReduce approach

   Implementations of MapReduce
    ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig,
      Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu,
      Flume, Kafka, Azkaban, Oozie, Greenplum
   The key idea of the MapReduce approach:
    ◦ A target problem needs to be parallelizable

    ◦ First, the problem gets split into a set of smaller problems (Map step)
    ◦ Next, smaller problems are solved in a parallel way
    ◦ Finally, a set of solutions to the smaller problems get synthesized
      into a solution of the original problem (Reduce step)
   NoSQL class of databases have in common:
    ◦   To support large amounts of data
    ◦   Have mostly non-SQL interface
    ◦   Operate on distributed infrastructures (e.g. Hadoop)
    ◦   Are based on key-value pairs (no predefined schema)
    ◦   …are flexible and fast
   Implementations
    ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase,
      Hypertable, Voldemort, Riak, ZooKeeper…
   …when the operations on data are complex:
    ◦ …e.g. simple counting is not a complex problem
    ◦ Modeling and reasoning with data of different kinds
      can get extremely complex

   Good news about big-data:
    ◦ Often, because of vast amount of data, modeling
      techniques can get simpler (e.g. smart counting can
      replace complex model-based analytics)…
    ◦ …as long as we deal with the scale
   Research areas (such
    as IR, KDD, ML, NLP,
                            Usage
    SemWeb, …) are sub-
    cubes within the data   Quality
    cube
                            Context

                            Streaming

                            Scalability
   A risk with “Big-Data mining” is that an
    analyst can “discover” patterns that are
    meaningless
   Statisticians call it Bonferroni’s principle:
    ◦ Roughly, if you look in more places for interesting
      patterns than your amount of data will support, you
      are bound to find crap




                    Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
Example:
 We want to find (unrelated) people who at least twice
  have stayed at the same hotel on the same day
    ◦   109 people being tracked.
    ◦   1000 days.
    ◦   Each person stays in a hotel 1% of the time (1 day out of 100)
    ◦   Hotels hold 100 people (so 105 hotels).
    ◦   If everyone behaves randomly (i.e., no terrorists) will the data
        mining detect anything suspicious?
   Expected number of “suspicious” pairs of people:
    ◦ 250,000
    ◦ … too many combinations to check – we need to have some
      additional evidence to find “suspicious” pairs of people in
      some more efficient way


                           Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
   Smart sampling of data
    ◦ …reducing the original data while not losing the
      statistical properties of data
   Finding similar items
    ◦ …efficient multidimensional indexing
   Incremental updating of the models
    ◦ (vs. building models from scratch)
    ◦ …crucial for streaming data
   Distributed linear algebra
    ◦ …dealing with large sparse matrices
   On the top of the previous ops we perform
    usual data mining/machine learning/statistics
    operators:
    ◦ Supervised learning (classification, regression, …)
    ◦ Non-supervised learning (clustering, different types
      of decompositions, …)
    ◦ …


   …we are just more careful which algorithms
    we choose (typically linear or sub-linear
    versions)
   An excellent overview of the algorithms
    covering the above issues is the book
    “Rajaraman, Ullman: Mining of Massive
    Datasets”
   Good recommendations
    can make a big
    difference when keeping
    a user on a web site
    ◦ …the key is how rich the
      context model a system is
      using to select information
      for a user
    ◦ Bad recommendations <1%
      users, good ones >5% users
      click
    ◦ 200clicks/sec

                      Contextual
                     personalized
                  recommendations
                 generated in ~20ms
   Domain                   Referring Domain      Zip Code
   Sub-domain               Referring URL         State
   Page URL                 Outgoing URL          Income
   URL sub-directories                             Age
                             GeoIP Country         Gender
   Page Meta Tags           GeoIP State           Country
   Page Title               GeoIP City            Job Title
   Page Content                                    Job Industry
   Named Entities           Absolute Date
                             Day of the Week
   Has Query                Day period
   Referrer Query           Hour of the day
                             User Agent
Trend Detection System

                            User                   Stream of
 Log Files     Stream
              of clicks    profiles                 profiles
  (~100M
page clicks
 per day)


                                                                                                       Sales
                          Trends and
                          updated segments                                          Segments
                          Segment       Keywords

        NYT               Stock         Stock Market, mortgage, banking,
                          Market        investors, Wall Street, turmoil, New
       articles                         York Stock Exchange
                                                                                                Campaign
                          Health        diabetes, heart disease, disease, heart,
                                        illness                                                   to sell
                                                                                                segments
                                                                                      $
                          Green         Hybrid cars, energy, power, model,
                          Energy        carbonated, fuel, bulbs,

                          Hybrid cars   Hybrid cars, vehicles, model, engines,
                                        diesel

                          Travel        travel, wine, opening, tickets, hotel,
                                        sites, cars, search, restaurant


                                                                                               Advertisers
                          …             …
   50Gb of uncompressed log files
   50-100M clicks
   4-6M unique users
   7000 unique pages with more then 100 hits
Alarms Server

      Telecom
      Network                                                          Alarms
                   Alarms                       Live feed of data      Explorer
     (~25 000     ~10-100/sec
      devices)                                                         Server


   Alarms Explorer Server implements three
    real-time scenarios on the alarms stream:
    1. Root-Cause-Analysis – finding which device is
       responsible for occasional “flood” of alarms
    2. Short-Term Fault Prediction – predict which
       device will fail in next 15mins
    3. Long-Term Anomaly Detection – detect
       unusual trends in the network
   …system is used in British Telecom


                                                       Operator     Big board display
   The aim of the project is to collect and analyze
    most of the main-stream media across the world
    ◦ …from 35,000 publishers (180K RSS feeds) crawled in
      real time (few ~10articles per second)
    ◦ …each article document gets extracted, cleaned,
      semantically annotated, structure extracted

   Challenges are in terms of complexity of
    processing and querying of the extracted data
    ◦ …and matching textual information across the
      languages (cross-lingual technologies)


            http://render-project.eu/   http://www.xlike.org/
http://newsfeed.ijs.si/
Extracted
                         graph
                         of triples
    Plain text           from text




                     Text
                  Enrichment




“Enrycher” is available as
as a web-service generating
Semantic Graph, LOD links,
Entities, Keywords, Categories,
Text Summarization
   The aim is to use analytic techniques to
    visualize documents in different ways:
    ◦ Topic view
    ◦ Social view
    ◦ Temporal view
Query

Search
Results

 Topic Map


Selected
group of news




 Selected
 story
Query




Named
entities
in relation
US Elections
                               US Budget
  Query


 Result set

                    NATO-Russia
Topic Trends
Visualization
                               Mid-East
                               conflict



 Topics
 description
Dec 7th 1941
Apr 6th 1941
June 1944
Query
Conceptual map

Search Point


 Dynamic
 contextual
 ranking based
 on the search
 point
   Observe social and communication
     phenomena at a planetary scale
    Largest social network analyzed till 2010

 Research questions:
  How does communication change with user
   demographics (age, sex, language, country)?
  How does geography affect communication?
  What is the structure of the communication
   network?

“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          50
   We collected the data for June 2006
     Log size:
         150Gb/day (compressed)
     Total: 1 month of communication data:
         4.5Tb of compressed data
     Activity over June 2006 (30 days)
      ◦   245 million users logged in
      ◦   180 million users engaged in conversations
      ◦   17,5 million new accounts activated
      ◦   More than 30 billion conversations
      ◦   More than 255 billion exchanged messages
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          51
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   52
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   53
   Count the number of users logging in from
     particular location on the earth
“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008
                                                                                          54
   Logins from Europe




“Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   55
Hops     Nodes
                                                                                              1         10
                                                                                              2         78
                                                                                              3        396
                                                                                              4       8648
                                                                                              5     3299252
                                                                                              6    28395849
                                                                                              7    79059497
                                                                                              8    52995778
                                                                                              9    10321008
                                                                                              10    1955007
                                                                                              11    518410
                                                                                              12    149945
                                                                                              13     44616
                                                                                              14     13740
                                                                                              15      4476
                                                                                              16      1542
                                                                                              17       536
                                                                                              18       167
                                                                                              19        71

   6 degrees of separation [Milgram ’60s]          20                                                  29


    Average distance between two random users is 6.622
                                                    21                                                  16
                                                                                                       10
   90% of nodes can be reached in < 8 hops         23                                                   3
                                                                                              24         2
    “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008   25         3
   Big-Data is everywhere, we are just not used to
    deal with it

   The “Big-Data” hype is very recent
    ◦ …growth seems to be going up
    ◦ …evident lack of experts to build Big-Data apps

   Can we do “Big-Data” without big investment?
    ◦ …yes – many open source tools, computing machinery is
      cheap (to buy or to rent)
    ◦ …the key is knowledge on how to deal with data
    ◦ …data is either free (e.g. Wikipedia) or to buy (e.g.
      twitter)

More Related Content

What's hot

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data StackZubair Nabi
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data ScienceKenny Daniel
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altinmustafa sarac
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1RojaT4
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data HadoopApache Apex
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)SahilRaina21
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)MIT College Of Engineering,Pune
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesKathirvel Ayyaswamy
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3RojaT4
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopRojaT4
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big DataIndu Khemchandani
 

What's hot (20)

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
The Big Data Stack
The Big Data StackThe Big Data Stack
The Big Data Stack
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
The Evolution of Data Science
The Evolution of Data ScienceThe Evolution of Data Science
The Evolution of Data Science
 
Mongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner AltinMongo Internal Training session by Soner Altin
Mongo Internal Training session by Soner Altin
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Introduction of big data unit 1
Introduction of big data unit 1Introduction of big data unit 1
Introduction of big data unit 1
 
Intro to Big Data Hadoop
Intro to Big Data HadoopIntro to Big Data Hadoop
Intro to Big Data Hadoop
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
Big Data Analytics(concepts of hadoop mapreduce,mahout,k-means clustering,hbase)
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Data Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research OpportunitiesData Mining and Big Data Challenges and Research Opportunities
Data Mining and Big Data Challenges and Research Opportunities
 
Big data technology unit 3
Big data technology unit 3Big data technology unit 3
Big data technology unit 3
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big Data Unit 4 - Hadoop
Big Data Unit 4 - HadoopBig Data Unit 4 - Hadoop
Big Data Unit 4 - Hadoop
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Data lake ppt
Data lake pptData lake ppt
Data lake ppt
 
Big Data: an introduction
Big Data: an introductionBig Data: an introduction
Big Data: an introduction
 
Intro to Data Science Big Data
Intro to Data Science Big DataIntro to Data Science Big Data
Intro to Data Science Big Data
 

Viewers also liked

Introduction to mining massive datasets
Introduction to mining massive datasetsIntroduction to mining massive datasets
Introduction to mining massive datasetsViet-Trung TRAN
 
Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章Makoto Kawano
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with RGreat Wide Open
 
Data Analytics using R
Data Analytics using RData Analytics using R
Data Analytics using Rrichards9696
 
Intoroduction of Pandas with Python
Intoroduction of Pandas with PythonIntoroduction of Pandas with Python
Intoroduction of Pandas with PythonAtsushi Hayakawa
 
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesDIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesBjarne Viken
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on RAjay Ohri
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBoston Consulting Group
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningGerard de Melo
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Python for R Users
Python for R UsersPython for R Users
Python for R UsersAjay Ohri
 
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionPoverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionAdaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionInstitute for the Future
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Abovekmurray230
 

Viewers also liked (20)

Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Introduction to mining massive datasets
Introduction to mining massive datasetsIntroduction to mining massive datasets
Introduction to mining massive datasets
 
Introduction to hadoop
Introduction to hadoopIntroduction to hadoop
Introduction to hadoop
 
Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章Python for Data Anaysis第2回勉強会4,5章
Python for Data Anaysis第2回勉強会4,5章
 
Big Data Analytics with R
Big Data Analytics with RBig Data Analytics with R
Big Data Analytics with R
 
Big data analytics using R
Big data analytics using RBig data analytics using R
Big data analytics using R
 
Data Analytics using R
Data Analytics using RData Analytics using R
Data Analytics using R
 
Intoroduction of Pandas with Python
Intoroduction of Pandas with PythonIntoroduction of Pandas with Python
Intoroduction of Pandas with Python
 
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop SlidesDIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
DIGITAL MARKETING FOR BUSINESS GROWTH Workshop Slides
 
RHadoop
RHadoopRHadoop
RHadoop
 
A Workshop on R
A Workshop on RA Workshop on R
A Workshop on R
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Tools and techniques for data science
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
 
Scalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data MiningScalable Learning Technologies for Big Data Mining
Scalable Learning Technologies for Big Data Mining
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
 
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future InnovtionPoverty Hedging_Catalysts For Change Zone of Future Innovtion
Poverty Hedging_Catalysts For Change Zone of Future Innovtion
 
3d Views Portfolio
3d Views Portfolio3d Views Portfolio
3d Views Portfolio
 
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future InnovtionAdaptive Shelters_Catalysts For Change Zone of Future Innovtion
Adaptive Shelters_Catalysts For Change Zone of Future Innovtion
 
All Of The Above
All Of The AboveAll Of The Above
All Of The Above
 

Similar to Big Data Tutorial - Marko Grobelnik - 25 May 2012

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4heyramzz
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4GV prasad
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4Pragati Singh
 
The information supernova
The information supernovaThe information supernova
The information supernovaAlaa Al-Agamawi
 
Big data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-businessBig data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-businessAmit Bhargava
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013nkabra
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...Steve Omohundro
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh BellurSTS FORUM 2016
 

Similar to Big Data Tutorial - Marko Grobelnik - 25 May 2012 (20)

EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big data tutorial
Big data tutorialBig data tutorial
Big data tutorial
 
The information supernova
The information supernovaThe information supernova
The information supernova
 
Big data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-businessBig data-analytics-changing-way-organizations-conducting-business
Big data-analytics-changing-way-organizations-conducting-business
 
Big data business case
Big data   business caseBig data   business case
Big data business case
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Data mining
Data miningData mining
Data mining
 
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
The AI Platform Business Revolution: Matchmaking, Empathetic Technology, and ...
 
Internet of Things
Internet of ThingsInternet of Things
Internet of Things
 
Big Data on AWS
Big Data on AWSBig Data on AWS
Big Data on AWS
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big Data - Umesh Bellur
Big Data - Umesh BellurBig Data - Umesh Bellur
Big Data - Umesh Bellur
 

Recently uploaded

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFMichael Gough
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 

Recently uploaded (20)

All These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDFAll These Sophisticated Attacks, Can We Really Detect Them - PDF
All These Sophisticated Attacks, Can We Really Detect Them - PDF
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 

Big Data Tutorial - Marko Grobelnik - 25 May 2012

  • 1. Marko Grobelnik marko.grobelnik@ijs.si Jozef Stefan Institute Kalamaki, May 25th 2012
  • 2. Introduction ◦ What is Big data? ◦ Why Big-Data? ◦ When Big-Data is really a problem?  Techniques  Tools  Applications  Literature
  • 3.
  • 4.
  • 5. ‘Big-data’ is similar to ‘Small-data’, but bigger  …but having data bigger consequently requires different approaches: ◦ techniques, tools, architectures  …with an aim to solve new problems ◦ …and old problems in a better way.
  • 6. From “Understanding Big Data” by IBM
  • 7.
  • 9. Key enablers for the growth of “Big Data” are: ◦ Increase of storage capacities ◦ Increase of processing power ◦ Availability of data
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20. Where processing is hosted? ◦ Distributed Servers / Cloud (e.g. Amazon EC2)  Where data is stored? ◦ Distributed Storage (e.g. Amazon S3)  What is the programming model? ◦ Distributed Processing (e.g. MapReduce)  How data is stored & indexed? ◦ High-performance schema-free databases (e.g. MongoDB)  What operations are performed on data? ◦ Analytic / Semantic Processing (e.g. R, OWLIM)
  • 21. Computing and storage are typically hosted transparently on cloud infrastructures ◦ …providing scale, flexibility and high fail-safety  Distributed Servers ◦ Amazon-EC2, Google App Engine, Elastic, Beanstalk, Heroku  Distributed Storage ◦ Amazon-S3, Hadoop Distributed File System
  • 22. Distributed processing of Big-Data requires non- standard programming models ◦ …beyond single machines or traditional parallel programming models (like MPI) ◦ …the aim is to simplify complex programming tasks  The most popular programming model is MapReduce approach  Implementations of MapReduce ◦ Hadoop (http://hadoop.apache.org/), Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban, Oozie, Greenplum
  • 23. The key idea of the MapReduce approach: ◦ A target problem needs to be parallelizable ◦ First, the problem gets split into a set of smaller problems (Map step) ◦ Next, smaller problems are solved in a parallel way ◦ Finally, a set of solutions to the smaller problems get synthesized into a solution of the original problem (Reduce step)
  • 24. NoSQL class of databases have in common: ◦ To support large amounts of data ◦ Have mostly non-SQL interface ◦ Operate on distributed infrastructures (e.g. Hadoop) ◦ Are based on key-value pairs (no predefined schema) ◦ …are flexible and fast  Implementations ◦ MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper…
  • 25.
  • 26. …when the operations on data are complex: ◦ …e.g. simple counting is not a complex problem ◦ Modeling and reasoning with data of different kinds can get extremely complex  Good news about big-data: ◦ Often, because of vast amount of data, modeling techniques can get simpler (e.g. smart counting can replace complex model-based analytics)… ◦ …as long as we deal with the scale
  • 27. Research areas (such as IR, KDD, ML, NLP, Usage SemWeb, …) are sub- cubes within the data Quality cube Context Streaming Scalability
  • 28. A risk with “Big-Data mining” is that an analyst can “discover” patterns that are meaningless  Statisticians call it Bonferroni’s principle: ◦ Roughly, if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 29. Example:  We want to find (unrelated) people who at least twice have stayed at the same hotel on the same day ◦ 109 people being tracked. ◦ 1000 days. ◦ Each person stays in a hotel 1% of the time (1 day out of 100) ◦ Hotels hold 100 people (so 105 hotels). ◦ If everyone behaves randomly (i.e., no terrorists) will the data mining detect anything suspicious?  Expected number of “suspicious” pairs of people: ◦ 250,000 ◦ … too many combinations to check – we need to have some additional evidence to find “suspicious” pairs of people in some more efficient way Example taken from: Rajaraman, Ullman: Mining of Massive Datasets
  • 30. Smart sampling of data ◦ …reducing the original data while not losing the statistical properties of data  Finding similar items ◦ …efficient multidimensional indexing  Incremental updating of the models ◦ (vs. building models from scratch) ◦ …crucial for streaming data  Distributed linear algebra ◦ …dealing with large sparse matrices
  • 31. On the top of the previous ops we perform usual data mining/machine learning/statistics operators: ◦ Supervised learning (classification, regression, …) ◦ Non-supervised learning (clustering, different types of decompositions, …) ◦ …  …we are just more careful which algorithms we choose (typically linear or sub-linear versions)
  • 32. An excellent overview of the algorithms covering the above issues is the book “Rajaraman, Ullman: Mining of Massive Datasets”
  • 33.
  • 34. Good recommendations can make a big difference when keeping a user on a web site ◦ …the key is how rich the context model a system is using to select information for a user ◦ Bad recommendations <1% users, good ones >5% users click ◦ 200clicks/sec Contextual personalized recommendations generated in ~20ms
  • 35. Domain  Referring Domain  Zip Code  Sub-domain  Referring URL  State  Page URL  Outgoing URL  Income  URL sub-directories  Age  GeoIP Country  Gender  Page Meta Tags  GeoIP State  Country  Page Title  GeoIP City  Job Title  Page Content  Job Industry  Named Entities  Absolute Date  Day of the Week  Has Query  Day period  Referrer Query  Hour of the day  User Agent
  • 36. Trend Detection System User Stream of Log Files Stream of clicks profiles profiles (~100M page clicks per day) Sales Trends and updated segments Segments Segment Keywords NYT Stock Stock Market, mortgage, banking, Market investors, Wall Street, turmoil, New articles York Stock Exchange Campaign Health diabetes, heart disease, disease, heart, illness to sell segments $ Green Hybrid cars, energy, power, model, Energy carbonated, fuel, bulbs, Hybrid cars Hybrid cars, vehicles, model, engines, diesel Travel travel, wine, opening, tickets, hotel, sites, cars, search, restaurant Advertisers … …
  • 37. 50Gb of uncompressed log files  50-100M clicks  4-6M unique users  7000 unique pages with more then 100 hits
  • 38. Alarms Server Telecom Network Alarms Alarms Live feed of data Explorer (~25 000 ~10-100/sec devices) Server  Alarms Explorer Server implements three real-time scenarios on the alarms stream: 1. Root-Cause-Analysis – finding which device is responsible for occasional “flood” of alarms 2. Short-Term Fault Prediction – predict which device will fail in next 15mins 3. Long-Term Anomaly Detection – detect unusual trends in the network  …system is used in British Telecom Operator Big board display
  • 39. The aim of the project is to collect and analyze most of the main-stream media across the world ◦ …from 35,000 publishers (180K RSS feeds) crawled in real time (few ~10articles per second) ◦ …each article document gets extracted, cleaned, semantically annotated, structure extracted  Challenges are in terms of complexity of processing and querying of the extracted data ◦ …and matching textual information across the languages (cross-lingual technologies) http://render-project.eu/ http://www.xlike.org/
  • 41. Extracted graph of triples Plain text from text Text Enrichment “Enrycher” is available as as a web-service generating Semantic Graph, LOD links, Entities, Keywords, Categories, Text Summarization
  • 42. The aim is to use analytic techniques to visualize documents in different ways: ◦ Topic view ◦ Social view ◦ Temporal view
  • 45. US Elections US Budget Query Result set NATO-Russia Topic Trends Visualization Mid-East conflict Topics description
  • 49. Query Conceptual map Search Point Dynamic contextual ranking based on the search point
  • 50. Observe social and communication phenomena at a planetary scale  Largest social network analyzed till 2010 Research questions:  How does communication change with user demographics (age, sex, language, country)?  How does geography affect communication?  What is the structure of the communication network? “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 50
  • 51. We collected the data for June 2006  Log size: 150Gb/day (compressed)  Total: 1 month of communication data: 4.5Tb of compressed data  Activity over June 2006 (30 days) ◦ 245 million users logged in ◦ 180 million users engaged in conversations ◦ 17,5 million new accounts activated ◦ More than 30 billion conversations ◦ More than 255 billion exchanged messages “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 51
  • 52. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 52
  • 53. “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 53
  • 54. Count the number of users logging in from particular location on the earth “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 54
  • 55. Logins from Europe “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 55
  • 56. Hops Nodes 1 10 2 78 3 396 4 8648 5 3299252 6 28395849 7 79059497 8 52995778 9 10321008 10 1955007 11 518410 12 149945 13 44616 14 13740 15 4476 16 1542 17 536 18 167 19 71  6 degrees of separation [Milgram ’60s] 20 29 Average distance between two random users is 6.622 21 16  10  90% of nodes can be reached in < 8 hops 23 3 24 2 “Planetary-Scale Views on a Large Instant-Messaging Network” Leskovec & Horvitz WWW2008 25 3
  • 57.
  • 58. Big-Data is everywhere, we are just not used to deal with it  The “Big-Data” hype is very recent ◦ …growth seems to be going up ◦ …evident lack of experts to build Big-Data apps  Can we do “Big-Data” without big investment? ◦ …yes – many open source tools, computing machinery is cheap (to buy or to rent) ◦ …the key is knowledge on how to deal with data ◦ …data is either free (e.g. Wikipedia) or to buy (e.g. twitter)