Visualization
  Lifecycle

datainsight
 San Francisco 2011
     Raffael Marty
“Transform a dataset into a captive story.”



              ‣ Assess                        Youʼre on your own              Art
              ‣ Parse

              ‣ Clean

              ‣ Visualize



                                          Visualization Tools and Libraries

pixlcloud | collect. visualize. understand.                                         Copyright (c) 2011
Audience
                                                        Expert

                                                                  Fun

                                Technical                               Overview

                                              Boring




                                                       Beginner

pixlcloud | collect. visualize. understand.                                        Copyright (c) 2011
Visualization Process
                                Contextual Data

                                                                                                     iterations




      Data Sources                  (Data Store)             Structured Data                   Visual Representation


                                                                               visualization

                                                   parsing
                                                                               feature selection

                                    files
                                    database
                                                              filtering
                                                              aggregation
                                                              cleansing



pixlcloud | collect. visualize. understand.                                                                       Copyright (c) 2011
Data Sources
      ‣ File                                             XML, JSON, CSV, TSV

      ‣Database                                 mysql -u root -p mydatabase < dump.sql

      ‣ API
                                                curl ‘http://freebase.com/api/service/
         ‣Factual                                   search?query=al+gore&indent=1’

         ‣Freebase

         ‣Infochimps

         ‣OpenStreetMap




pixlcloud | collect. visualize. understand.                                    Copyright (c) 2011
Explore Data
      ‣ What          is the data about?
      ‣ What          are the data features/columns?
      ‣ Is    there a common structure in the data?
      ‣ What          are the data types?
                Nov 7 09:14:46 fwbox kernel: DROPPED IN=eth0 OUT= MAC=00:0c:29:e3:45:bd:00:0c:
                29:b5:5c:ee:08:00 SRC=10.1.222.31 DST=10.1.222.202 LEN=60 TOS=0x00 PREC=0x00
                TTL=64 ID=63849 DF PROTO=TCP SPT=58485 DPT=9111 WINDOW=5840 RES=0x00 SYN URGP=0

                May 25 20:24:20 ram-laptop kernel: BLOCK any in: IN=eth1 OUT=
                MAC=00:13:02:ac:d8:ea:00:09:5b:3d:df:00:08:00 SRC=213.175.90.24 DST=192.168.0.15
                LEN=576 TOS=0x00 PREC=0x00 TTL=115 ID=23513 PROTO=TCP SPT=9030 DPT=56772
                WINDOW=65535 RES=0x00 ACK URGP=0



pixlcloud | collect. visualize. understand.                                                  Copyright (c) 2011
Parsing and Normalization
     ‣ Parsing
        ‣ extraction of entities / features

        ‣ imposing structure
                                              Oct 13 20:00:43.874401 rule 193/0(match): block in on xl0:
                                              212.251.89.126.3859 >: S 1818630320:1818630320(0) win 65535 <mss
                                              1460,nop,nop,sackOK> (DF)

        ‣ often use regexes                   Oct 13 20:00:43 fwbox local4:warn|warning fw07 %PIX-4-106023: Deny tcp
                                              src internet: 212.251.89.126/3859 dst 212.254.110.98/135 by access-
                                              group "internet_access_in"

     ‣ Normalize                              Oct 13 20:00:43 fwbox kernel: DROPPED IN=eth0 OUT=
                                              MAC=ff:ff:ff:ff:ff:ff:00:0f:cc:81:40:94:08:00 SRC=212.251.89.126
                                              DST=212.254.110.98 LEN=576 TOS=0x00 PREC=0x00 TTL=255 ID=8624
                                              PROTO=TCP SPT=3859 DPT=135 LEN=556
        ‣ field normalization

        ‣ term normalization: block, deny, dropped

     ‣ Generate              a common output format for vis-tools (e.g., CSV)

pixlcloud | collect. visualize. understand.                                                          Copyright (c) 2011
Parser
                        Oct 13 20:00:38.018152 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 62.2.32.250.53:    34388 [1au][|domain] (DF)

Raw                     Oct 13 20:00:38.115862 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 192.134.0.49.53:   49962 [1au][|domain] (DF)

                        Oct 13 20:00:38.157238 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 194.25.2.133.53:   14434 [1au][|domain] (DF)




                                      (.*) rule ([-d]+/d+)(.*?): (pass|block) (in|out) on (w+):
                                                    (d+.d+.d+.d+).?(d*) [<>]
Regex / Parser                                       (d+.d+.d+.d+).?(d*): (.*)



                        Oct 13 20:00:38.018152,57/0,match,pass,in,xl1,195.141.69.45,1030,62.2.32.250,53,34388 [1au][|domain] (DF)
Normalized              Oct 13 20:00:38.115862,57/0,match,pass,in,xl1,195.141.69.45,1030,192.134.0.49,53,49962 [1au][|domain] (DF)
(CSV)                   Oct 13 20:00:38.157238,57/0,match,pass,in,xl1,195.141.69.45,1030,194.25.2.133,53,14434 [1au][|domain] (DF)




pixlcloud | collect. visualize. understand.                                                                                        Copyright (c) 2011
UNIX Tools
     ‣ grep
        ‣cat file | grep –v “foo”

     ‣ awk
        ‣awk –F, ‘{printf(“%s,%sn”,$2,$1);}’

        ‣awk -F, -v OFS=, ‘{print $2,$1}’

     ‣ sed
        ‣sed -e 's/fubar/foobar/g' filename




pixlcloud | collect. visualize. understand.                Copyright (c) 2011
Regular Expression Resources
     ‣   http://regexlib.com
     ‣   http://www.regular-expressions.info
     ‣   http://gskinner.com/RegExr




pixlcloud | collect. visualize. understand.    Copyright (c) 2011
Data Cleansing
     ‣ Filter




     ‣ Normalize                  (see earlier)



     ‣ Aggregation



pixlcloud | collect. visualize. understand.             Copyright (c) 2011
Load CSV into Database
    # mysql -u <user> -p                          Sometimes you just load
                                                  your data into a tool,
                                                  and you can omit this
    mysql> create database data;                  step


    mysql> create table set1 (id int, address
           varchar(20), ...);
    mysql> LOAD DATA LOCAL INFILE 'input_file' INTO
                        TABLE set1 FIELDS TERMINATED BY ',' LINES
                        TERMINATED BY 'n';



pixlcloud | collect. visualize. understand.                        Copyright (c) 2011
Contextual Data
     ‣ Either          dump into DB or use via API calls to augment



     ‣ IP    -> Geo mapping
     ‣ Information                    about countries
     ‣ Port       number -> service name


pixlcloud | collect. visualize. understand.                     Copyright (c) 2011
Feature Selection
     ‣ What          are the fields you are interested in?
     ‣ Compute                 new fields
        ‣start time, end time -> duration

        ‣IP subnets [ 10.2.4.2 -> 10.0.0.0/8 or 192.168.1.2 -> 192.168.1.0/24 ]
        ‣ Entropy: H ( X ) = E ( I ( X ) )

     ‣ Dimensionality                         reduction
        ‣See Bryan’s talk!




pixlcloud | collect. visualize. understand.                             Copyright (c) 2011
Choose Your Poison




pixlcloud | collect. visualize. understand.      Copyright (c) 2011
Ode to the Pie




pixlcloud | collect. visualize. understand.               Copyright (c) 2011
A Good Visual
     ‣ Chose        the right graph            ‣ Simultaneous   views




     ‣ Reduce         non-data ink                         ‣ Interactivity




pixlcloud | collect. visualize. understand.                                  Copyright (c) 2011
Visual Transformations
     ‣ keep         iterating on visual transformations, change
        ‣color

        ‣shape

        ‣features display

     ‣ add        new fields?
     ‣ add        more context?
     ‣ is   the output expressive?
     ‣ capture             output and prettify it for presentation
pixlcloud | collect. visualize. understand.                          Copyright (c) 2011
Data Visualization Tools
and Libraries
Tools and Libraries
      ‣ http://datainsightsf.com/resources/
         ‣Choose what’s appropriate!

      ‣ Data         Analysis and Visualization LInuX
         ‣davix.secviz.org

      ‣ GraphViz
         ‣graphviz.org

      ‣ AfterGlow                 (CSV -> DOT)
         ‣afterglow.sf.net


pixlcloud | collect. visualize. understand.             Copyright (c) 2011
Libraries
     ‣ Reporting                 Libraries         ‣Visualization Libraries
        ‣HighCharts                                 ‣TheJIT
        ‣Flot                                       ‣Graphael
        ‣Google Chart API                           ‣Protovis
        ‣Open Flash Chart                           ‣ProcessingJS
        ‣JQuery Sparklines                          ‣Flare
        ‣Polymaps                                   ‣D3


                                                    -

pixlcloud | collect. visualize. understand.                              Copyright (c) 2011
HighCharts



 ‣ Click-Through

 ‣ On      load
    ‣near real-time updates

 ‣ Zoom
                                                           www.highcharts.com

pixlcloud | collect. visualize. understand.                             Copyright (c) 2011
Google Visualization API


     http://code.google.com/apis/visualization/interactive_charts.html

      ‣ JavaScript

      ‣ Based          on DataTables()
      ‣ Many          graphs
      ‣ Playground
         ‣   http://code.google.com/apis/ajax/playground

pixlcloud | collect. visualize. understand.                              Copyright (c) 2011
ProtoVis
     ‣ JavaScript               based visualization library
     ‣ Charting

     ‣ Treemaps

     ‣ BoxPlots

     ‣ Parallel           Coordinates
     ‣ etc.


                                                   http://vis.stanford.edu/protovis/
pixlcloud | collect. visualize. understand.                                  Copyright (c) 2011
TheJIT   http://thejit.org/

     ‣ JavaScript               InfoVis Toolkit
     ‣ Interactive

     ‣ Link        Graphs




pixlcloud | collect. visualize. understand.                      Copyright (c) 2011
Processing
     ‣   Visualization library
     ‣   Java based
     ‣   Interactive (event handling)
     ‣   Number of libraries to
         ‣ draw    in OpenGL
         ‣ read    XML files
     ‣   Processing JS
         ‣ JavaScript
         ‣ HTML 5 Canvas
         ‣ WebGL                                   http://processingjs.org/
         ‣ Web IDE                                 http://processing.org/

pixlcloud | collect. visualize. understand.                                   Copyright (c) 2011
Visualization Tools
     ‣ Gephi

     ‣R

     ‣ Matlab

     ‣ Mondrian

     ‣ PicViz

     ‣ Treemap                 4.1
     ‣ Google             Earth
pixlcloud | collect. visualize. understand.         Copyright (c) 2011
Gephi   http://gephi.org


     ‣ reads:           CSV, DOT, etc.
     ‣ graph           analysis algorithms
     ‣ highly           interactive




pixlcloud | collect. visualize. understand.                    Copyright (c) 2011
PicViz




                                                   http://www.wallinfire.net/picviz/

pixlcloud | collect. visualize. understand.                               Copyright (c) 2011
Treemap 4.1




                                                    http://www.cs.umd.edu/hcil/treemap/
pixlcloud | collect. visualize. understand.                                  Copyright (c) 2011
Google Earth
 • KML data format for
   encoding data




pixlcloud | collect. visualize. understand.   Copyright (c) 2011
pixlcloud                       buy now



collect. visualize. understand.



                 @raffaelmarty

Visualization Lifecycle

  • 1.
    Visualization Lifecycle datainsight San Francisco 2011 Raffael Marty
  • 2.
    “Transform a datasetinto a captive story.” ‣ Assess Youʼre on your own Art ‣ Parse ‣ Clean ‣ Visualize Visualization Tools and Libraries pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 3.
    Audience Expert Fun Technical Overview Boring Beginner pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 4.
    Visualization Process Contextual Data iterations Data Sources (Data Store) Structured Data Visual Representation visualization parsing feature selection files database filtering aggregation cleansing pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 5.
    Data Sources ‣ File XML, JSON, CSV, TSV ‣Database mysql -u root -p mydatabase < dump.sql ‣ API curl ‘http://freebase.com/api/service/ ‣Factual search?query=al+gore&indent=1’ ‣Freebase ‣Infochimps ‣OpenStreetMap pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 6.
    Explore Data ‣ What is the data about? ‣ What are the data features/columns? ‣ Is there a common structure in the data? ‣ What are the data types? Nov 7 09:14:46 fwbox kernel: DROPPED IN=eth0 OUT= MAC=00:0c:29:e3:45:bd:00:0c: 29:b5:5c:ee:08:00 SRC=10.1.222.31 DST=10.1.222.202 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=63849 DF PROTO=TCP SPT=58485 DPT=9111 WINDOW=5840 RES=0x00 SYN URGP=0 May 25 20:24:20 ram-laptop kernel: BLOCK any in: IN=eth1 OUT= MAC=00:13:02:ac:d8:ea:00:09:5b:3d:df:00:08:00 SRC=213.175.90.24 DST=192.168.0.15 LEN=576 TOS=0x00 PREC=0x00 TTL=115 ID=23513 PROTO=TCP SPT=9030 DPT=56772 WINDOW=65535 RES=0x00 ACK URGP=0 pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 7.
    Parsing and Normalization ‣ Parsing ‣ extraction of entities / features ‣ imposing structure Oct 13 20:00:43.874401 rule 193/0(match): block in on xl0: 212.251.89.126.3859 >: S 1818630320:1818630320(0) win 65535 <mss 1460,nop,nop,sackOK> (DF) ‣ often use regexes Oct 13 20:00:43 fwbox local4:warn|warning fw07 %PIX-4-106023: Deny tcp src internet: 212.251.89.126/3859 dst 212.254.110.98/135 by access- group "internet_access_in" ‣ Normalize Oct 13 20:00:43 fwbox kernel: DROPPED IN=eth0 OUT= MAC=ff:ff:ff:ff:ff:ff:00:0f:cc:81:40:94:08:00 SRC=212.251.89.126 DST=212.254.110.98 LEN=576 TOS=0x00 PREC=0x00 TTL=255 ID=8624 PROTO=TCP SPT=3859 DPT=135 LEN=556 ‣ field normalization ‣ term normalization: block, deny, dropped ‣ Generate a common output format for vis-tools (e.g., CSV) pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 8.
    Parser Oct 13 20:00:38.018152 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 62.2.32.250.53: 34388 [1au][|domain] (DF) Raw Oct 13 20:00:38.115862 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 192.134.0.49.53: 49962 [1au][|domain] (DF) Oct 13 20:00:38.157238 rule 57/0(match): pass in on xl1: 195.141.69.45.1030 > 194.25.2.133.53: 14434 [1au][|domain] (DF) (.*) rule ([-d]+/d+)(.*?): (pass|block) (in|out) on (w+): (d+.d+.d+.d+).?(d*) [<>] Regex / Parser (d+.d+.d+.d+).?(d*): (.*) Oct 13 20:00:38.018152,57/0,match,pass,in,xl1,195.141.69.45,1030,62.2.32.250,53,34388 [1au][|domain] (DF) Normalized Oct 13 20:00:38.115862,57/0,match,pass,in,xl1,195.141.69.45,1030,192.134.0.49,53,49962 [1au][|domain] (DF) (CSV) Oct 13 20:00:38.157238,57/0,match,pass,in,xl1,195.141.69.45,1030,194.25.2.133,53,14434 [1au][|domain] (DF) pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 9.
    UNIX Tools ‣ grep ‣cat file | grep –v “foo” ‣ awk ‣awk –F, ‘{printf(“%s,%sn”,$2,$1);}’ ‣awk -F, -v OFS=, ‘{print $2,$1}’ ‣ sed ‣sed -e 's/fubar/foobar/g' filename pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 10.
    Regular Expression Resources ‣ http://regexlib.com ‣ http://www.regular-expressions.info ‣ http://gskinner.com/RegExr pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 11.
    Data Cleansing ‣ Filter ‣ Normalize (see earlier) ‣ Aggregation pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 12.
    Load CSV intoDatabase # mysql -u <user> -p Sometimes you just load your data into a tool, and you can omit this mysql> create database data; step mysql> create table set1 (id int, address varchar(20), ...); mysql> LOAD DATA LOCAL INFILE 'input_file' INTO TABLE set1 FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n'; pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 13.
    Contextual Data ‣ Either dump into DB or use via API calls to augment ‣ IP -> Geo mapping ‣ Information about countries ‣ Port number -> service name pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 14.
    Feature Selection ‣ What are the fields you are interested in? ‣ Compute new fields ‣start time, end time -> duration ‣IP subnets [ 10.2.4.2 -> 10.0.0.0/8 or 192.168.1.2 -> 192.168.1.0/24 ] ‣ Entropy: H ( X ) = E ( I ( X ) ) ‣ Dimensionality reduction ‣See Bryan’s talk! pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 15.
    Choose Your Poison pixlcloud| collect. visualize. understand. Copyright (c) 2011
  • 16.
    Ode to thePie pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 17.
    A Good Visual ‣ Chose the right graph ‣ Simultaneous views ‣ Reduce non-data ink ‣ Interactivity pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 18.
    Visual Transformations ‣ keep iterating on visual transformations, change ‣color ‣shape ‣features display ‣ add new fields? ‣ add more context? ‣ is the output expressive? ‣ capture output and prettify it for presentation pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 19.
  • 20.
    Tools and Libraries ‣ http://datainsightsf.com/resources/ ‣Choose what’s appropriate! ‣ Data Analysis and Visualization LInuX ‣davix.secviz.org ‣ GraphViz ‣graphviz.org ‣ AfterGlow (CSV -> DOT) ‣afterglow.sf.net pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 21.
    Libraries ‣ Reporting Libraries ‣Visualization Libraries ‣HighCharts ‣TheJIT ‣Flot ‣Graphael ‣Google Chart API ‣Protovis ‣Open Flash Chart ‣ProcessingJS ‣JQuery Sparklines ‣Flare ‣Polymaps ‣D3 - pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 22.
    HighCharts ‣ Click-Through ‣ On load ‣near real-time updates ‣ Zoom www.highcharts.com pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 23.
    Google Visualization API http://code.google.com/apis/visualization/interactive_charts.html ‣ JavaScript ‣ Based on DataTables() ‣ Many graphs ‣ Playground ‣ http://code.google.com/apis/ajax/playground pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 24.
    ProtoVis ‣ JavaScript based visualization library ‣ Charting ‣ Treemaps ‣ BoxPlots ‣ Parallel Coordinates ‣ etc. http://vis.stanford.edu/protovis/ pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 25.
    TheJIT http://thejit.org/ ‣ JavaScript InfoVis Toolkit ‣ Interactive ‣ Link Graphs pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 26.
    Processing ‣ Visualization library ‣ Java based ‣ Interactive (event handling) ‣ Number of libraries to ‣ draw in OpenGL ‣ read XML files ‣ Processing JS ‣ JavaScript ‣ HTML 5 Canvas ‣ WebGL http://processingjs.org/ ‣ Web IDE http://processing.org/ pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 27.
    Visualization Tools ‣ Gephi ‣R ‣ Matlab ‣ Mondrian ‣ PicViz ‣ Treemap 4.1 ‣ Google Earth pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 28.
    Gephi http://gephi.org ‣ reads: CSV, DOT, etc. ‣ graph analysis algorithms ‣ highly interactive pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 29.
    PicViz http://www.wallinfire.net/picviz/ pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 30.
    Treemap 4.1 http://www.cs.umd.edu/hcil/treemap/ pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 31.
    Google Earth •KML data format for encoding data pixlcloud | collect. visualize. understand. Copyright (c) 2011
  • 32.
    pixlcloud buy now collect. visualize. understand. @raffaelmarty