Big Data Panel

Deepak Agarwal, LinkedIn
      JSM, 2012
    San Diego, USA
Disclaimer

• The opinions expressed here are mine and in no way
  represent the official position of LinkedIn
Example of user interaction




ts, user-id, <items shown at various slots>, <what was clicked?>, < what after click>

user-id: covariates; item-id: covariates; user-id: social connections
Statistical Challenges
• Exploratory Analysis (EDA), Visualization
  – Retrospective (on Terabytes)
  – More Real Time (every few minutes/hours)
• Statistical Modeling
  – Scale (computational challenge)
  – Dimensionality (few categorical variables with
    massive number of levels interacting)
  – Temporal Effects
Statistical Challenges continued
• Experiments
  – To test new methods, test hypothesis from
    randomized experiments
  – Adaptive experiments
• Forecasting
  – Planning, advertising
My 2 cents
•   BD problems are complex, messy, it is inherently multi-disciplinary
•   Having a clear idea of the underlying scientific problem important
•   Systems, Algorithms, Statistics, Machine Learning, Optimization,…
•   Statisticians could consume wonderful tools created by our friends,
    develop the statistical aspects
     – Learn Hadoop and Pig, it has become easy (like R)
• Emphasis on areas like sampling, DOE, scalable model fitting

• More collaborative programs between academia/industry,
  academia/government
     – E.g. Training programs for students working with problem ownners

Bdpanel

  • 1.
    Big Data Panel DeepakAgarwal, LinkedIn JSM, 2012 San Diego, USA
  • 2.
    Disclaimer • The opinionsexpressed here are mine and in no way represent the official position of LinkedIn
  • 3.
    Example of userinteraction ts, user-id, <items shown at various slots>, <what was clicked?>, < what after click> user-id: covariates; item-id: covariates; user-id: social connections
  • 4.
    Statistical Challenges • ExploratoryAnalysis (EDA), Visualization – Retrospective (on Terabytes) – More Real Time (every few minutes/hours) • Statistical Modeling – Scale (computational challenge) – Dimensionality (few categorical variables with massive number of levels interacting) – Temporal Effects
  • 5.
    Statistical Challenges continued •Experiments – To test new methods, test hypothesis from randomized experiments – Adaptive experiments • Forecasting – Planning, advertising
  • 6.
    My 2 cents • BD problems are complex, messy, it is inherently multi-disciplinary • Having a clear idea of the underlying scientific problem important • Systems, Algorithms, Statistics, Machine Learning, Optimization,… • Statisticians could consume wonderful tools created by our friends, develop the statistical aspects – Learn Hadoop and Pig, it has become easy (like R) • Emphasis on areas like sampling, DOE, scalable model fitting • More collaborative programs between academia/industry, academia/government – E.g. Training programs for students working with problem ownners