Innovazione tecnologica,
    web e statistica

data, big data, open data
      Vincenzo Patruno




       Roma, 29 gennaio 2013
Un mondo di dati
Obama’s Election
Victory
Creating a “single source of truth”


Combining disparate data sources of potential donors, volunteers and voters
(email, postal, telephone, mobile and social contacts with historical voting
records, polling and fundraising data)




They built a single view of individuals that informed
their strategies for raising funds, mobilizing
volunteers and securing votes.



Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Profiling and predicting


  Demographics and data collected by fieldwork on the campaign trail were
  added to the mix, allowing predictive modelling to score people on their
  likelihood to donate or vote for the Democrats.




  Channels of communication were optimized, and the
  type of messaging was tailored to maximize the
  likelihood of response.




Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Turning data into the human touch


The   power    of                         localised              networks      and
neighbourhoods
Using centralized data to provide geo-targeted insight, campaign
volunteers could base themselves in the areas that mattered most, talking
to the voters they had got to know since the start of the 2008 campaign.


Deliver their message from within communities

The impact of this saw them receive double the votes they achieved in
2008 in the marginal states.


Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Turning data into the human touch



Sono stati oltre due milioni i piccoli donatori che hanno
versato nelle casse della sua campagna oltre 427
milioni di dollari.

Circa il 55% dei fondi raccolti sono arrivate da donazioni
sotto i 200 dollari.




Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Focus on the swing states

Regular polling of states like Ohio throughout the
campaign provided valuable data for the team to process
and analyze trends.
For example, the analysts could track the impact of the three TV debates on
the democratic vote in real-time and were able to identify specific segments to
target with campaign material – split by region, demographics and the profile
scoring that had been modeled in the new database. One Democrat official
commented that they scenario tested the election 66,000 times every night in
order to calculate predicted outcomes for swing states.
Campaign resource was then allocated appropriately to persuade undecided
voters most likely to pledge their allegiance to Obama.

By the time election day came around, the Democrats had
a clear idea of how voting in the swing states was looking.
Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Data science involvement in the election wasn’t
just restricted to the candidates’ teams.

Nate Silver used sabermetrics to accurately predict the outcome of
all 50 state votes




Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
Big Data – What Is It?
     Big Data – What Is It?

Volume. Variety. Velocity.
      Volume. Variety. Velocity.
Variability. Complexity.
               Taken together, these three “Vs” of Big Data were originally posited by Gartner’s
               Doug Laney in a 2001 research report.




               Variability. Complexity.




  Taken together, these three “Vs” of Big Data were
  originally posited by Gartner’s Doug Laney in a 2001 research report.
“It’s difficult to imagine the
power that you’re going to have
when so many different sorts of
data are available”

Tim Berners Lee
Data never sleeps



Facebook World
Source: http://ipcarrier.blogspot.it/2010/12/facebook-world.html
http://youtu.be/xJXOavGwAW8
The Data Deluge
Mass Opinion Business Intelligence (MOBI) analyzes and
classifies comments made online and distills the information into a
pre-defined, structured database.



MOBI methodology combines online measurement, cloud
computing and market research that provides live consumer
sentiment data around brands, products and purchase influencing
factors using decision-supported information from millions of
unsolicited opinions.




http://en.wikipedia.org/wiki/WiseWindow
Financial Services Industry: Bloomberg and
WiseWindow use social media and big data to improve
investment returns.
http://en.wikipedia.org/wiki/WiseWindow
Natural disasters: Twitter was a richer and more up-to-
date source of information about the 5.8 magnitude
quake in Virginia.
http://youtu.be/PThAriHjk10

Traffic Twitter after Japan earthquake
Automotive Industry: Big data analysis of social media
comments can predict trends in automotive equipment
failures.
Telecommunications: T-Mobile used big data integrated
with its transaction systems and social media to
dramatically cut customer defections in one quarter.
Energy/Utility Industry: GE is going to use social media
reports to track outages faster and better.
Advertising Industry: Dachis Group used big data
analysis of social media to create a more up-to-date and
accurate ranking of the competitive position of
engagement at large companies.
Marketing: Nestle is using social media listening and
analytics to engage at scale in the market using its big
data powered central command center.
Education Industry: DoSomething.org engaged 200,000
people worldwide in Facebook to combat bullying in
schools and analyzed their sentiments.
Criminal Justice: Police department around the United
States now use social media analysis extensively to
fight crime.
Health Care Industry: Using social media and big data to
track cholera outbreaks in Haiti faster and more
accurately.
API




Application
Programming
Interface
API
API
API




API
API
http://apistat.istat.it/?q=gettable&dataset=DCIS_POPORESBIL&dim=82,0,0,0&lang=
0&tr=&te=




                         query string
API
http://developers.facebook.com/                          https://dev.twitter.com/




             Es:
             https://stream.twitter.com/1.1/statuses/sample.json
http://cs.croakun.com/
[…]
7%  work
                                                              Thanx Piet! 




     50%   pointless babble



3%
                                                      5%
                                                      TV and Radio
politics
                         10%  spare time activities
http://youtu.be/iReY3W9ZkLU
Top 5 Myths about Big Data
1. Big Data is Only About Massive Data Volume
     Generally speaking, experts consider petabytes of data volumes as the starting point for
     Big Data, although this volume indicator is a moving target. Therefore, while volume is
     important, the next two “Vs” are better individual indicators.
     Variety refers to the many different data and file types that are important to manage and
     analyze more thoroughly, but for which traditional relational databases are poorly suited.
     Some examples of this variety include sound and movie files, images, documents, geo-
     location data, web logs, and text strings.
     Velocity is about the rate of change in the data and how quickly it must be used to create
     real value. Traditional technologies are especially poorly suited to storing and using high-
     velocity data. So new approaches are needed. If the data in question is created and
     aggregates very quickly and must be used swiftly to uncover patterns and problems, the
     greater the velocity and the more likely that you have a Big Data opportunity.
Top 5 Myths about Big Data
2. Big Data Means Hadoop
   Hadoop is the Apache open-source software framework for working with Big Data. It was derived
   from Google technology and put to practice by Yahoo and others. But, Big Data is too varied and
   complex for a one-size-fits-all solution. While Hadoop has surely captured the greatest name
   recognition, it is just one of three classes of technologies well suited to storing and managing Big
   Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores.
   (See myth number five below for more about NoSQL.) Examples of MPP data stores include
   EMC’s Greenplum, IBM’s Netezza, and HP’s Vertica.
Top 5 Myths about Big Data

3. Big Data Means Unstructured Data
  Big Data is probably better termed “multi-structured” as it could include text strings,
  documents of all types, audio and video files, metadata, web pages, email
  messages, social media feeds, form data, and so on. The consistent trait of these
  varied data types is that the data schema isn’t known or defined when the data is
  captured and stored. Rather, a data model is often applied at the time the data is
  used.
Top 5 Myths about Big Data
4. Big Data is for Social Media Feeds and
Sentiment Analysis
  Simply put, if your organization needs to broadly analyze web traffic, IT system logs,
  customer sentiment, or any other type of digital shadows being created in record
  volumes each day, Big Data offers a way to do this. Even though the early pioneers of
  Big Data have been the largest, web-based, social media companies -- Google, Yahoo,
  Facebook -- it was the volume, variety, and velocity of data generated by their services
  that required a radically new solution rather than the need to analyze social feeds or
  gauge audience sentiment.
Top 5 Myths about Big Data
5. NoSQL means No SQL
 NoSQL means “not only” SQL because these types of data stores offer domain-specific access and
 query techniques in addition to SQL or SQL-like interfaces. Technologies in this NoSQL category
 include key value stores, document-oriented databases, graph databases, big table structures, and
 caching data stores. The specific native access methods to stored data provide a rich, low-latency
 approach, typically through a proprietary interface. SQL access has the advantage of familiarity and
 compatibility with many existing tools. Although this is usually at some expense of latency driven by the
 interpretation of the query to the native “language” of the underlying system.
 For example, Cassandra, the popular open source key value store offered in commercial form by
 DataStax, not only includes native APIs for direct access to Cassandra data, but CQL (it’s SQL-like
 interface) as its emerging preferred access mechanism. It’s important to choose the right NoSQL
 technology to fit both the business problem and data type and the many categories of NoSQL
 technologies offer plenty of choice.
http://youtu.be/0eUeL3n7fDs
http://www.whitehouse.gov/sites/default/files/microsites/ostp/big_data
_press_release_final_2.pdf
“Data scientist”
http://open.nasa.gov/blog/2012/10/04/what-is-nasa-doing-with-big-
data-today/
Possono i BD essere utilizzati per misurare
fenomeni Economici, Sociali, Ambientali?
Indagini Campionarie




                Archivi Amministrativi
Le statistiche sui prezzi
Significance magazine august 2012
Big Data and City Living – what can it do for us?
Big Data Sources

       Sensors        Transact
                        ional


Adminis
 trative                Behavio
                          ural
           Tracking
           Devices
Web Scraping
Web Scraping

  http://www.comune.torino.it/ambiente/aria/qualita_aria/dati_aria/valori
  _annuali_pm10.shtml




  https://scraperwiki.com/scrapers/valori_pm10_in_comune_di_torino/




Esempi
Web Scraping

                     http://thebiobucket.blogspot.it/2011/10/little-webscraping-
                     exercise.html#more




    Esempi
Milano, 13 Dicembre 2012
Web Scraping




Esempi
    http://www.metoffice.gov.uk/climate/uk/stationdata/armaghdata.txt
http://elezionistorico.interno.it/
Open Data
  L'Open Data si basa sulla
  constatazione che il dato pubblico
  è stato prodotto con denaro
  pubblico, quindi della collettività.
  Ed è a questa che il dato deve
  essere restituito.
Open Data

  Dati liberamente accessibili a tutti
  in formato aperto senza restrizioni
  di copyright, brevetti o altre forme
  di controllo che ne limitino
  l’utilizzo.
Open Government




Si intende un modello di Governance a
livello centrale e locale basato sull'apertura
(partecipazione e collaborazione) e sulla
trasparenza nei confronti dei cittadini
Le iniziative
Le iniziative
Open Data
              Government
                 Data




Corporate                  Community
  Data        Open Data      Data
Community Data
Corporate Data
I cataloghi di dati
I formati degli Open Data




Es. http://www.istat.it/it/files/2012/12/Tavole_XLS.zip
I cataloghi di dati
                      territorio
      categoria
                                       titolo
                   fonte
                                                licenza
            data                 descrizione




Metadati
                           url
Volume      Fonti



Relazioni   Contesto
Data Integration

 Ricoveri
ospedalieri
Data Integration                   Concessio
                                                           ni edilizie


                                             Cause di
                                              morte

Casellario                 Ricoveri
Giudiziario               ospedalieri
                                                                         Delibere
                                                                         comunali

                                              Industrie
                                             per ATECO

      Dati                        Spesa
    ambientali                   sanitaria                   Provvedim
                                                                enti
                                                              Regionali
                                                   Mappe

                 Dichiarazio
                   ni dei
                   Politici
Data Integration                    Concessio
                                                     ni edilizie


                                    Cause di
                                     morte

Casellario        Ricoveri
Giudiziario      ospedalieri
                                                                   Delibere
                                                                   comunali

                                     Industrie
                                    per ATECO

      Dati               Spesa
    ambientali          sanitaria                      Provvedim
                                                          enti
                                                        Regionali
                                          Dati
                                        Geografici


                    Dichiarazio
                       ni dei
                      Politici
RDF
LOD Cloud
LOD Cloud
Linked Open Data




Semantic Web
Grazie dell’attenzione!

    @vincpatruno


      vincenzo.patruno@istat.it



  http://www.vincenzopatruno.org

data, big data, open data

  • 1.
    Innovazione tecnologica, web e statistica data, big data, open data Vincenzo Patruno Roma, 29 gennaio 2013
  • 2.
  • 3.
  • 4.
    Creating a “singlesource of truth” Combining disparate data sources of potential donors, volunteers and voters (email, postal, telephone, mobile and social contacts with historical voting records, polling and fundraising data) They built a single view of individuals that informed their strategies for raising funds, mobilizing volunteers and securing votes. Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 5.
    Profiling and predicting Demographics and data collected by fieldwork on the campaign trail were added to the mix, allowing predictive modelling to score people on their likelihood to donate or vote for the Democrats. Channels of communication were optimized, and the type of messaging was tailored to maximize the likelihood of response. Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 6.
    Turning data intothe human touch The power of localised networks and neighbourhoods Using centralized data to provide geo-targeted insight, campaign volunteers could base themselves in the areas that mattered most, talking to the voters they had got to know since the start of the 2008 campaign. Deliver their message from within communities The impact of this saw them receive double the votes they achieved in 2008 in the marginal states. Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 7.
    Turning data intothe human touch Sono stati oltre due milioni i piccoli donatori che hanno versato nelle casse della sua campagna oltre 427 milioni di dollari. Circa il 55% dei fondi raccolti sono arrivate da donazioni sotto i 200 dollari. Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 8.
    Focus on theswing states Regular polling of states like Ohio throughout the campaign provided valuable data for the team to process and analyze trends. For example, the analysts could track the impact of the three TV debates on the democratic vote in real-time and were able to identify specific segments to target with campaign material – split by region, demographics and the profile scoring that had been modeled in the new database. One Democrat official commented that they scenario tested the election 66,000 times every night in order to calculate predicted outcomes for swing states. Campaign resource was then allocated appropriately to persuade undecided voters most likely to pledge their allegiance to Obama. By the time election day came around, the Democrats had a clear idea of how voting in the swing states was looking. Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 9.
    Data science involvementin the election wasn’t just restricted to the candidates’ teams. Nate Silver used sabermetrics to accurately predict the outcome of all 50 state votes Source: http://connect.icrossing.co.uk/obamas-big-data-election-victory_9423
  • 10.
    Big Data –What Is It? Big Data – What Is It? Volume. Variety. Velocity. Volume. Variety. Velocity. Variability. Complexity. Taken together, these three “Vs” of Big Data were originally posited by Gartner’s Doug Laney in a 2001 research report. Variability. Complexity. Taken together, these three “Vs” of Big Data were originally posited by Gartner’s Doug Laney in a 2001 research report.
  • 11.
    “It’s difficult toimagine the power that you’re going to have when so many different sorts of data are available” Tim Berners Lee
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
    Mass Opinion BusinessIntelligence (MOBI) analyzes and classifies comments made online and distills the information into a pre-defined, structured database. MOBI methodology combines online measurement, cloud computing and market research that provides live consumer sentiment data around brands, products and purchase influencing factors using decision-supported information from millions of unsolicited opinions. http://en.wikipedia.org/wiki/WiseWindow
  • 21.
    Financial Services Industry:Bloomberg and WiseWindow use social media and big data to improve investment returns. http://en.wikipedia.org/wiki/WiseWindow
  • 22.
    Natural disasters: Twitterwas a richer and more up-to- date source of information about the 5.8 magnitude quake in Virginia.
  • 23.
  • 24.
    Automotive Industry: Bigdata analysis of social media comments can predict trends in automotive equipment failures.
  • 25.
    Telecommunications: T-Mobile usedbig data integrated with its transaction systems and social media to dramatically cut customer defections in one quarter.
  • 26.
    Energy/Utility Industry: GEis going to use social media reports to track outages faster and better.
  • 27.
    Advertising Industry: DachisGroup used big data analysis of social media to create a more up-to-date and accurate ranking of the competitive position of engagement at large companies.
  • 28.
    Marketing: Nestle isusing social media listening and analytics to engage at scale in the market using its big data powered central command center.
  • 29.
    Education Industry: DoSomething.orgengaged 200,000 people worldwide in Facebook to combat bullying in schools and analyzed their sentiments.
  • 30.
    Criminal Justice: Policedepartment around the United States now use social media analysis extensively to fight crime.
  • 31.
    Health Care Industry:Using social media and big data to track cholera outbreaks in Haiti faster and more accurately.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    http://developers.facebook.com/ https://dev.twitter.com/ Es: https://stream.twitter.com/1.1/statuses/sample.json
  • 39.
  • 40.
  • 41.
    7% work Thanx Piet!  50% pointless babble 3% 5% TV and Radio politics 10% spare time activities
  • 42.
  • 43.
    Top 5 Mythsabout Big Data 1. Big Data is Only About Massive Data Volume Generally speaking, experts consider petabytes of data volumes as the starting point for Big Data, although this volume indicator is a moving target. Therefore, while volume is important, the next two “Vs” are better individual indicators. Variety refers to the many different data and file types that are important to manage and analyze more thoroughly, but for which traditional relational databases are poorly suited. Some examples of this variety include sound and movie files, images, documents, geo- location data, web logs, and text strings. Velocity is about the rate of change in the data and how quickly it must be used to create real value. Traditional technologies are especially poorly suited to storing and using high- velocity data. So new approaches are needed. If the data in question is created and aggregates very quickly and must be used swiftly to uncover patterns and problems, the greater the velocity and the more likely that you have a Big Data opportunity.
  • 44.
    Top 5 Mythsabout Big Data 2. Big Data Means Hadoop Hadoop is the Apache open-source software framework for working with Big Data. It was derived from Google technology and put to practice by Yahoo and others. But, Big Data is too varied and complex for a one-size-fits-all solution. While Hadoop has surely captured the greatest name recognition, it is just one of three classes of technologies well suited to storing and managing Big Data. The other two classes are NoSQL and Massively Parallel Processing (MPP) data stores. (See myth number five below for more about NoSQL.) Examples of MPP data stores include EMC’s Greenplum, IBM’s Netezza, and HP’s Vertica.
  • 45.
    Top 5 Mythsabout Big Data 3. Big Data Means Unstructured Data Big Data is probably better termed “multi-structured” as it could include text strings, documents of all types, audio and video files, metadata, web pages, email messages, social media feeds, form data, and so on. The consistent trait of these varied data types is that the data schema isn’t known or defined when the data is captured and stored. Rather, a data model is often applied at the time the data is used.
  • 46.
    Top 5 Mythsabout Big Data 4. Big Data is for Social Media Feeds and Sentiment Analysis Simply put, if your organization needs to broadly analyze web traffic, IT system logs, customer sentiment, or any other type of digital shadows being created in record volumes each day, Big Data offers a way to do this. Even though the early pioneers of Big Data have been the largest, web-based, social media companies -- Google, Yahoo, Facebook -- it was the volume, variety, and velocity of data generated by their services that required a radically new solution rather than the need to analyze social feeds or gauge audience sentiment.
  • 47.
    Top 5 Mythsabout Big Data 5. NoSQL means No SQL NoSQL means “not only” SQL because these types of data stores offer domain-specific access and query techniques in addition to SQL or SQL-like interfaces. Technologies in this NoSQL category include key value stores, document-oriented databases, graph databases, big table structures, and caching data stores. The specific native access methods to stored data provide a rich, low-latency approach, typically through a proprietary interface. SQL access has the advantage of familiarity and compatibility with many existing tools. Although this is usually at some expense of latency driven by the interpretation of the query to the native “language” of the underlying system. For example, Cassandra, the popular open source key value store offered in commercial form by DataStax, not only includes native APIs for direct access to Cassandra data, but CQL (it’s SQL-like interface) as its emerging preferred access mechanism. It’s important to choose the right NoSQL technology to fit both the business problem and data type and the many categories of NoSQL technologies offer plenty of choice.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
    Possono i BDessere utilizzati per misurare fenomeni Economici, Sociali, Ambientali?
  • 54.
    Indagini Campionarie Archivi Amministrativi
  • 55.
  • 56.
    Significance magazine august2012 Big Data and City Living – what can it do for us?
  • 57.
    Big Data Sources Sensors Transact ional Adminis trative Behavio ural Tracking Devices
  • 59.
  • 60.
    Web Scraping http://www.comune.torino.it/ambiente/aria/qualita_aria/dati_aria/valori _annuali_pm10.shtml https://scraperwiki.com/scrapers/valori_pm10_in_comune_di_torino/ Esempi
  • 61.
    Web Scraping http://thebiobucket.blogspot.it/2011/10/little-webscraping- exercise.html#more Esempi Milano, 13 Dicembre 2012
  • 63.
    Web Scraping Esempi http://www.metoffice.gov.uk/climate/uk/stationdata/armaghdata.txt
  • 64.
  • 67.
    Open Data L'Open Data si basa sulla constatazione che il dato pubblico è stato prodotto con denaro pubblico, quindi della collettività. Ed è a questa che il dato deve essere restituito.
  • 68.
    Open Data Dati liberamente accessibili a tutti in formato aperto senza restrizioni di copyright, brevetti o altre forme di controllo che ne limitino l’utilizzo.
  • 69.
    Open Government Si intendeun modello di Governance a livello centrale e locale basato sull'apertura (partecipazione e collaborazione) e sulla trasparenza nei confronti dei cittadini
  • 70.
  • 71.
  • 72.
    Open Data Government Data Corporate Community Data Open Data Data
  • 73.
  • 74.
  • 76.
  • 77.
    I formati degliOpen Data Es. http://www.istat.it/it/files/2012/12/Tavole_XLS.zip
  • 78.
    I cataloghi didati territorio categoria titolo fonte licenza data descrizione Metadati url
  • 79.
    Volume Fonti Relazioni Contesto
  • 80.
  • 81.
    Data Integration Concessio ni edilizie Cause di morte Casellario Ricoveri Giudiziario ospedalieri Delibere comunali Industrie per ATECO Dati Spesa ambientali sanitaria Provvedim enti Regionali Mappe Dichiarazio ni dei Politici
  • 82.
    Data Integration Concessio ni edilizie Cause di morte Casellario Ricoveri Giudiziario ospedalieri Delibere comunali Industrie per ATECO Dati Spesa ambientali sanitaria Provvedim enti Regionali Dati Geografici Dichiarazio ni dei Politici
  • 83.
  • 84.
  • 85.
  • 86.
  • 87.
    Grazie dell’attenzione! @vincpatruno vincenzo.patruno@istat.it http://www.vincenzopatruno.org