Data Science Highlights
Upcoming SlideShare
Loading in...5
×
 

Data Science Highlights

on

  • 1,259 views

Highlights and summary of long-running programmatic research on data science; practices, roles, tools, skills, organization models, workflow, outlook, etc. Profiles and persona definition for data ...

Highlights and summary of long-running programmatic research on data science; practices, roles, tools, skills, organization models, workflow, outlook, etc. Profiles and persona definition for data scientist model. Landscape of org models for data science and drivers for capability planning. Secondary research materials.

Statistics

Views

Total Views
1,259
Views on SlideShare
744
Embed Views
515

Actions

Likes
3
Downloads
15
Comments
0

15 Embeds 515

http://www.joelamantia.com 466
http://feedly.com 10
http://www.feedspot.com 10
http://www.inoreader.com 7
http://blogs.oracle.com 4
http://joelamantia.com 3
http://newsconsole.com 3
http://inoreader.com 2
http://rss.neurozone.fr 2
http://bloggers1033.rssing.com 2
http://feeds.feedburner.com 2
http://oracle.com 1
http://prsync.com 1
https://reader.aol.com 1
http://digg.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Data Science Highlights Data Science Highlights Presentation Transcript

    • Data Science Highlights
    • Data Scientist Square - San Francisco Bay Area Job Description Square is hiring a Data Scientist on our Risk team. The Risk team at Square is responsible for enabling growth while mitigating financial loss associated with transactions. We work closely with our Product and Growth teams to craft a fantastic experience for our buyers and sellers. ! Desired Skills & Experience As a Data Scientist on our Risk team, you will use machine learning and data mining techniques to assess and mitigate the risk of every entity and event in our network. You will sift through a growing stream of payments, settlements, and customer activities to identify suspicious behavior with high precision and recall. You will explore and understand our customer base deeply, become an expert in Risk, and contribute to a world-class underwriting system that helps Square provide delightful service to both buyers and sellers.
 
 To accomplish this, you are comfortable writing production code in Java and conducting exploratory data analysis in R and Python. You can take statistical and engineering ideas from prototype to production. You excel in a small team setting and you apply expert knowledge in engineering and statistics.
 
 Responsibilities 1. Investigate, prototype and productionize features and machine learning models to identify good and bad behavior. 2. Design, build, and maintain robust production machine learning systems. 3. Create visualizations that enable rapid detection of suspicious activity in our user base. 4. Become a domain expert in Risk. 5. Participate in the engineering life-cycle. 6. Work closely with analysts and engineers. ! Requirements 1. Ability to find a needle in the haystack. With data. 2. Extensive programming experience in Java and Python or R. 3. Knowledge of one or more of the following: classification techniques in machine learning, data mining, applied statistics, data visualization. 4. Concise verbal and written articulation of complex ideas. ! Even Better 1. Contagious passion for Square’s mission. 2. Data mining or machine learning competition experience. ! Company Description Square is a revolutionary service that enables anyone to accept credit cards anywhere. Square offers an easy to use, free credit card reader that plugs into a phone or iPad. It's simple to sign up. There is no extra equipment, complicated contracts, monthly fees or merchant account required.
 
 Co-founded by Jim McKelvey and Jack Dorsey in 2009, the company is headquartered in San Francisco.
    • Sense Maker Segment Sense makers need to create and/or employ insights to accomplish their business goals and satisfy their responsibilities. ! These insights emerge from independent and collaborative discovery efforts that involve direct interaction with discovery applications, and participation in discovery environments. Insight Consumer ! Analyst Casual Analyst Data Scientist Analytics Manager ! Problem Solver
    • Data Scientist: Profile
    • Data Scientist Data Scientist / Senior Research Scientist Data Scientists work with other members of the Data science team, using emerging methods and tools to engage with ‘Big Data’ from a variety of external and internal sources. Data Scientists aim to generate actionable insights that transform the organization; enhance existing products, services and operations; and identify, define and prototype new data-driven products, services, and offerings. They have advanced analytical skills and/or a specialized educational background, and rely on open-source and custom- created tools, to address the ad-hoc and open-horizon questions the Data Science team takes on. Data Scientists collaborate with Insight Consumers, evolving and publishing insights and prototypes of new offerings. Business Goals & Work Setting • Create new data-driven products, services, business opportunities • Transform the business with insights derived from Big Data • Create effective tools and infrastructure for the data science group and other analytical groups within the organization • Develop prototypes based on proprietary or open source tools • Prototype new ways to visualize and understand data relationships • May work within a business unit, providing analytical capability to that unit only, or a centralized Data Science group ! Discovery Needs • Solves complex, critical problems & significant and unique issues. • Have numerous and dynamic ill-formed questions with unpredictable needs for data, visualization, discovery capabilities ! Discovery Tools • Open source tools and platforms for big data, ETL, visualization, analysis, statistics: Hadoop, Cassandra, Kafka, Voldemorte, • Open source algorithms languages: R, HIVE, PIG, • Custom-developed analytical tools Engagement w/ Discovery Applications • Creates custom discovery applications to suit their own needs • Application lifecycle involvement: rolls their own from scratch, iterates and then publishes to wider audiences / productizes • Original author of all discovery solution elements: data / data sets, information models, discovery applications and workspaces • Shares / publishes insights to decision-making groups & social forums in the business ! Collaboration • Works with Engineers and Software Architects to create prototypes and products • Collaborates with Data Scientists on ill-formed questions ! Skills & Expertise • Data management, analytics modeling and business analysis • Prototyping / software engineering • Discovery: advanced statistics, quantitative and qualitative analysis, machine learning, data mining, natural language processing, computational linguistics, broad knowledge of applied mathematics, statistical methods and algorithms
    • Profiles & Discovery Problem Spectrum D ata Scientist Analyst(all) C asualAnalyst Problem Solver Ill-formed Well-formed
    • The ‘Conway Model’
    • http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png
    • http://nirvacana.com/thoughts/wp-content/uploads/ 2013/07/RoadToDataScientist1.png
    • What sort of animal? They seem different than analysts: • problem set • relationship to discovery tools • skills and professional profile • discovery / analytical methods • perspective • workflow and collaboration ! Are they? How?
    • Areas of Investigation • Workflow • Environment • Organizational model • Pain points • Tools • Data landscape • Analytical practices • Project structure • Unmet needs
    • Interviews
    • Discussion Guide Can you please walk me through a recent or current project? a. How was the project initiated? b. How defined was the business problem in the beginning? Did the problem change? c. Where/who did you obtain data sets from? How did you make the decision? d.Describe the data you used: How did the data sets look like? How big were they? Were they structured or unstructured? e. What tools or techniques did you use to do the analyses? Did they map to the specific steps you mentioned just now? f. How did you decide these were the tools/techniques to use? To what extent were these decisions made by yourself and to what extent were they standardized by your group/team? g. How did you present the results of your analyses? What tools did you use? What do you like and dislike about your current tool set? h. Which stage of this project was the most challenging? To what extent did the tools satisfy what you intended to do? What features were lacking? i. How much collaboration was there during each stage of the project? i. Background and role of collaborators ii. Collaboration modes iii. Types of information shared ! Thinking about the projects you have worked on, is there a common approach you take to address these problems? How did you decide on this approach/tools? !
    • Transcripts & Recordings
    • Synthesis
    • Findings
    • Business Analytics (future) Data Science (now) =
    • Creates  data-­‐driven  insights,  offerings,  and  resources  to  transform  the  organiza7on Work  Experience    10  Years   Educa0on  Ph.D.  Sta7s7cs,  MS  Bio-­‐Informa7cs Job  Title    Senior  Data  Scien7st   Company    LInkedIn Summarize  &  Communicate   ! Review  findings  with  colleagues;   summarize  ,visualize,  and   communicate  key  findings  to   Insight  Consumers/decision   makers Prototype  &  Experiment   with  data  driven  feature:   ! How  can  we    prototype/ evaluate  this  w/out   disrup0ng  the  site? Gather  Data  &   Analyze  Results   ! Use  descrip0ve,   inferen0al,  and   predic0ve  sta0s0cs   to  evaluate    results Analyze  &  Iden7fy  causal/ predic7ve  factors:   Who  are  the  best   candidates  to  contact  for  a   job  based  on  recruiter   needs  and  profile  content? Dana  Data  Scien0st   • Defining  and  capturing  useful  measures  of   online  aMen0on   • GeOng  all  the  data  analy0c  tools  to  work   together  properly     • No  current  workflow  support  or  tools  for  data   wrangling,  analysis,    experimenta0on,,  and   prototyping • Effec0ve  tools  to  help  experiment  with  and   evaluate  value  /u0lity  of  features  and   ac0vi0es  for  users   • Ability  to  rapidly  prototype  data-­‐driven   features  w/out  risk  of  online  service   disrup0ons • Open  source  data  manipula0on,  mining  &   analysis  tools  including  R,  Pig,  Hadoop,  Python,   etc.     • Sta0s0cal  packages  such  as  SAS,  SPSS,  etc.   • Custom  analy0cal  tools  built  using  open  source   components  and  languages • Leverage  data  to  support  the  org  mission   • Enhance  products  &  services  with  data-­‐driven   insights  and  features   • Use  data  to  iden0fy  new    opportuni0es  and   prototype/drive  new  customer  offerings   • Create  useful  data  sets/streams,  measures,  &   resources  (e.g.,  data  models,  algorithms,  etc. Key  Goals Tools Pain  Points Wish  List Sample  Workflow Dana  is  a  Senior  Data  Scien0st  who  has  worked  at  LinkedIn  for  5  years.    Dana’s   educa0on  includes  a  Ph.D.  in  Sta0s0cs  and  an  MS  in  Bio  Informa0cs.    Dana’s   previous  work  includes  posi0ons  in  academic  research  groups  as  a  doctoral   candidate  and  post-­‐doc,  as  well  as  so_ware  engineering  roles  in  the  Internet  &   technology  industries. • Dana  works  with  several  other  data  scien0sts  and  her  Analy0cs  Manager  on   a  centralized  team   • Dana  and  her  colleagues  aim  to  create  data  driven  insights,  features,   resources,  and  offerings  that  deliver  strategic  value  to  LinkedIn   • Dana  works  with  Analysts  on  other  teams  to  define  and  create  discovery   tools,  data  sets,  and  methods  for  use  by  their  groups  at  LinkedIn.   • Dana  &  team  are  visible  &  well  established  within  LinkedIn,  and  have  a  voice   in  product  strategy  and  opera0onal  context;  they  have  a  high  degree  of   autonomy  in  defining  data  science  projects   • Dana  works  with  Insight  Consumers  to  suggest  and  determine  poten0al  new   data  driven  offerings  to  prototype  and  evaluate. • How  can  we  leverage  data  to  increase  online  engagement  with  LinkedIn?     • How  should  we  measure  engagement  &  what  factors  drive  it?   • What  aspects  of  a  personal  profile  are  most  likely  to  encourage  /   discourage  new  connec0ons  between  people?   • How  can  we  increase  people’s  ac0vity  and  contribu0ons  to  topical     discussion  groups?   • What  factors  drive  the  effec0veness  of  our  marke0ng  campaigns?     • Why  did  one  of  our  marke0ng  campaigns  work  excep0onally  well?   • How  can  leverage  data  to  help  recruiters  iden0fy  and  communicate   effec0vely    with  qualified  and  poten0ally  available  candidates? Typical  Discovery  Scenarios  &  Problems Background Work  Context • Mines,  analyzes,  &  experiments  with  data  to   iden0fy  paMerns,  trends,  outliers,  causal   factors,  predic0ve  models,  &  opportuni0es   • Defines  and  explains  newly  devised   measurements,  predic0ve  models,  &   insights   • Compares  effec0veness  of  opera0ons  at   achieving  company  goals  for  engagement,   growth,  data  quality   • Produces  &  explores  new  data  sets   • Collaborates  with  other  data  scien0sts  to   capture  new  data  streams   • Prototypes  new  data  driven  site  features/ offerings   • Runs  data  based  experiments  to  test/ evaluate  models,  hypotheses  &  prototypes   • Communicates  &  explains  analyses  to   colleagues  &  Insight  Consumers I’ll  do  whatever  it  takes  –  wrangle,   extract,  manipulate,  analyze,   experiment,  prototype  –  to  use  data   to  drive  value  &  innovate “     ” Ac7vi7es
    • Empirical
    • AugmentedAugmented
    • AcceleratedAccelerated
    • Cooperative
    • Business Analytics Data Science Intuitive Manual Gradual Individual Empirical Augmented Accelerated Cooperative* Nature of sense making activity
    • The Essence • Empirical perspective • Business imperatives drive activities • Analytical approach • Recipe is always the same • Engineering always present • Data challenges are paramount • consume 60% - 80% of time and effort • Data volumes range huge to moderate (PB > MB) • Domain often drives analysis • Data scientists already have self-service • Some new problems, many the same • Use ‘advanced’ analytics, not conventional BA • Innovate by applying known analyses to new data • Current workflow fragmented across tools and data stores • Success can be a model, product, insight, infrastructure, tool
    • State of the Discipline A small set of formally constituted Data Science teams at major Internet and technology companies (Facebook, Google, MicroSoft, Yahoo, Twitter, LinkedIn, eBay, Amazon) lead the field in most identifiable respects: • maturity of practice - sophistication of methods, quality of infrastructure • history and tenure as formal function / group • business integration and impact • internal and public visibility • pace of innovation in methods, tools, architecture • quality and rate of contributions to open source and other tools / infrastructure • role in the industry and public discourse on data science: visibility in community, publication of experiments and findings, etc.
    • Tooling & Infrastructure Leading shops have their own comprehensive and often home-built / heavily customized data science environments, tools, infrastructure. ! This infrastructure is aligned to the particulars of their domain and business. Their data science environments are sometimes considerably more 'mature' than those of other shops. ! The large majority of existing data science teams and practices are 'followers' of these leaders, in the sense that while they have idiosyncratic problems and varying domains to address, they rely on innovation from the DS leaders to guide the evolution of their data science practices. ! Their environments reflect a mix of some purpose-built data science components, and infrastructure extended / adapted from business analytic needs such as BI.
    • Tooling & Infrastructure Many organizations are establishing new data science capabilities. A minority of these create new data science teams / practices from scratch without building out other conventional analytical capabilities such as BI. They will need new environments to support data science activities, and may leapfrog older generations of analytic environment, following leaders by directly creating new 'stacks' oriented more specifically for data science. ! The majority of organizations are creating new data science capabilities by building on existing analytical groups and functions. In terms of environments and infrastructure, these organizations have existing analytical environments aligned to BI and other business analytic functions, not specifically adapted to data science needs. Cumulative investment in these environments can be very high. ! New teams will need new tools. Existing teams will need new tools to support new discovery activities ! Berkeley Data Analytics Stack is the most visible open source 'platform' at the moment. No interview participants mentioned it.
    • Organizational Model Data science capability = provisioned via standard org models (ranging across in house, external, centralized, embedded, etc.). ! The ways data science teams and practice groups are managed and their relationship to the orgs they are part of seems to be conventional / familiar. ! We can summarize the landscape of organizational models for providing data science capability by plotting the size of data science team / pool of resources vs. the 'distance' from the problem / need. ! Landscape reflects common patterns for specialized expertise. ! This could shift over time as discovery maturity increases overall first within the analytics industry, then within the general business realm.
    • Discovery Problems Discovery efforts are set in motion by Insight Consumers, not Data Scientists. The success of efforts is gauged by Insight consumers. Insights are used by the originating Insight Consumers, not other analysts, and rarely other Insight Consumers. ! Multiple hypotheses are often explored in parallel, supported by multiple data sets / interim data products. ! Useful reconstructing of analytical workflows requires linear history of all steps / activities.
    • Discovery Problems Data science resources - Individuals, projects, and teams - are always aligned to business areas or strategic goals: e.g. the Content Insights team at LinkedIn supports analytical goals related to LinkedIn's major push to enhance its media presence and role in media. ! At large scales of group, this inverts - for example within a company, communities of practice are aligned to a discipline, and will include members who's activities span the needs of all the business units. ! No analytical efforts begin completely open-ended, with no idea of the nature or import of resulting insights. ! There is almost always a hypothesis, or more than one. (Even in more academic / research oriented settings, there is no basic research - all investigations are purposive and grounded in defined business intent.
    • PROBLEM NATURE • Well-defined • Explicit form: Why, What, and How questions • Implicit form: which question • Hypothesis are driven by domain knowledge or work experience • Not very different from the problems business analysts address ! vBusinesses address the same problems they have been working on, which are determined in the very beginning before resources should be allocated. Data scientists do not necessarily contribute to initiating new problems.
    • Data Science Insight Model Insight Model Data Product Product Analysts Outcomes
    • Skills Portfolio Data scientists use three kinds of languages: analysis (R- Matlab), scripting (python, perl), data processing (sql, pig) ! Analytical environments should allow integration of languages / capabilities they offer. ! Every analyst has their preferred language / method - defaults to using their own for analytical efforts. True within centralized analytical teams.
    • Skills
    • Discovery Maturity • Discovery is poorly understood and little recognized as a capability. It is rarely mentioned by any of the Data Science / Analytics professionals spoken with. When mentioned, it is seen as a small-scale activity and / or a desired outcome of particular projects, not something the organization needs to be able to in an ongoing / comprehensive / large-scale fashion such as understanding customers. ! • Data scientists understand their own challenges in terms of what stages / aspects of a data-centric workflow require greatest time, effort, or present most complexity or potential for introducing uncertainty / ambiguity into the efforts. Broader framings are the need for or desire to work on data-driven products, or transform and improve business through offering data-centered insights. ! • Product-centric data scientists (aim directly at making data-driven offerings) are a small minority of the active community. Many more are engineers with strong data skills, and many more analysts trying to acquire data science skills / perspective.
    • Supporting Factors • Regardless of particulars, the core ingredients remain the same: analytical skills and perspective, domain knowledge, engineering / tooling skills and perspective ! • In data science practices, analysis is always enabled by engineering - either localized to the data science team, or centrally provided via IT. ! • In BI practices, analysis is always enabled by IT and systems consultants / integrators (in house or external). ! • Leading DS groups rely on a number of hybrid approaches to support data cleansing and the evaluation of models, insights, and results - e.g. crowd source prep of data and checking of results for prototypes and experiments. ! • Data scientists rarely productionize code, analytical workflows, analytical tools. Engineers / IT convert 'prototype' artifacts created by data scientists into production code / tools.
    • Perspective Analytical The analytical perspective is the center of definition for all analytical roles. Contrast with engineers, who "make stuff". Analytical roles figure things out for some purpose: whether a model to inform a product prototype or provide insight. ! Empirical The empirical perspective is distinct from the analytical perspective, and marks 'true' data scientists. This revolves around framing and testing hypotheses formally and informally, often requires validation and interrogation of experimental methods and results by others, expects significant degree of transparency at (all) stages of the analytical effort.
    • Cooperation and Collaboration • Discovery efforts are structured as individual efforts - insights come from individual analytical engagement with data sets. ! • Collaboration between analysts is asynchronous. ! • Diversity of analytical tools / languages in practice = barrier to cooperation and collaboration. ! • There is little re-use of analytical insights by analysts to further other efforts. ! • When tools and/or problem domains are stable / known, analysts create individual and group assets for reuse - e.g. R script libraries, code snippets for SAS, templates for data set file formats and structures ! • Intermediate work products created during analytical work (data sets / subsets, code, analytical scripts, algorithms, interim results, hypotheses,) perceived as often irrelevant or throwaway, if not outright wrong. Little investment is made to annotate / preserve intermediate work products for individual or group re- use, sharing, review.
    • THE MANY SHADES OF COLLABORATION Independent: Have-it-all type data scientist (I know, I design & I implement) Linear: Complementary (Analysts know, data scientists design, engineers implement) Project-based: The missing piece ( Data scientists lead or support engineers) Consultancy: From abstract to concrete (Some data scientists know & design, some other data scientists implement)
    • Data Landscape • The physical location of data - where stored / what environment - is a significant cost factor for almost all aspects of analytical work. ! • Distributed data (managed / located in multiple stores) increases costs for many individual steps in analytical workflows. ! • Distributed data costs often = barrier to conducting insightful analysis using multiple techniques / steps. Default to basic / simple analysis to avoid high effort / low probability of success. ! • For analysts with low levels of db / data wrangling skill, even marginal distributed data costs = preventative barrier for engaging with data. ! • Most analysts reported having to migrate all of the data sets into the same data processing framework to begin analysis. [If all the data were in one place...]
    • DATA NATURE • Messy: various forms (Web logs, web pages, genome data, sales revenues….) • Scattered: Data scientists have to search from the wild (outside of enterprise databases) • Started “Big”, ended “Lean”: Meaningful data units are small in size • Standardization is key to all data science work: why engineers become data scientists ! v Data scientists are “data foragers“ and “data format equalizers”. They have the ability to manipulate large data sets and gradually narrow the data sets down to the exact units needed for analysis.
    • Algorithms and Analytical Tools • Well-known algorithms and methods are used to plan and structure experiments, discover insights, drive the creation of new models, evaluate the effectiveness of new models & products. ! • The algorithm and method are often determined by domain, such as TF-IDF for IR, Smith-Waterman for bioinformatics,
    • PROCESS NATURE • Wicked: Solutions are often times hardly pre-defined • Iterative three-step cycle: Data collection, data cleansing, & data analysis • trial-and-error: Hypotheses revision, hypotheses validation, & data recollection • Ad-hoc analysis chance encountering ! v Data scientists provide new perspectives to address old problems. The path to the solution is usually exploratory. But the goal has always been clear and pre-defined.
    • Data Science Workflows http://strata.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html
    • Data Science Workflows
    • Data Science Workflow • Frame problem / goal of effort • Identify and extract data to be used in effort from whole corpus / totality of available data • Exploratory identification and selection of working data for use in experiments • Define experiment(s): hypothesis / null hypothesis, methods, success criteria • Derive insight(s) • Wrangle, process, visualize, interpret • Codify / create new model reflecting insights outcomes from experiments • Validate new model(s) • Provision training data • Train new model • Validation and outcome of training model • Hand-off for implementation on production systems / as production code
    • Analysis Workflow & Activities • Empirical analysis of subsets of data • Understand topology of data, boundaries (sets / subsets, complete corpus, totality of data) • Outlier identification and profiling • How significant are outliers to overall topology • Comparative exclusion and profiling of resulting data subsets to understand their role, discover principal components • Find and analyze patterns, areas of interestingness / deserving attention • Find and analyze central actors / factors (in existing model that produced source data, in topology of working data, in patterns, etc.) • ID and understand their impact on local and global data topology and primary metrics if in several ways / more than one axis / at the same time • Discover and analyze relationships amongst central actors • Understand cycles, trends, changes (dynamic characteristics) for core actors, topology, patterns and structure • Understand causal factors • Codify / create new model reflecting insights & outcomes from experiments
    • • dynamic working data sets & subset • iterative • experimental frame
    • Key Workflows Insight Consumer <> Data Scientist originate, define, address discovery effort ! Data Scientist > Data Engineer create & evolve apps to address new & in-progress efforts ! Analyst <> Analyst define & address in-progress discovery efforts ! Data Scientist > internal networks create & curate archive & community
    • Needs What are the most common and useful statistical techniques you use during discovery and analysis efforts? ! What statistical capabilities or functions would be very useful if provided within discovery applications, and where would they be useful? “(1)  The  most  commonly  used  sta0s0cal  techniques  used  to  date  (in  our  strategic   planning  work)  are:    dimensionality  reduc0on  (par00on  clustering,  mul0ple   correspondence  analysis),  factor  analysis,  par00on  clustering  (k-­‐means,  k-­‐medoids,   fuzzy  clustering),  cluster  valida0on  techniques  (silhoueMe,  dunn’s  index,  connec0vity),   mul0variate  outlier  detec0on,  linear  regression,  and  logis0c  regression.” ! (2)  Techniques  that  would  assist  with  iden0fying  outliers  or  invalid  data.    Much  of  this   work  seems  to  be  done  by  hand.    I  believe  that  we  are  also  geOng  to  the  point  where   we  could  start  using  linear  regression  and  splines  (for  showing  trends).”
    • Needs For example, would system-generated descriptive statistical visualizations be useful for whole data sets - or for smaller user-selected groups of attributes? ! Would it be useful for the application to analyze and suggest possible distribution models it sees in the data; for the values of individual attributes, and/or for larger sets of data? “With  regards  to  your  last  ques0on  on  visualiza0on,  we  have  put  in  significant  effort  to   use  visualiza0on  in  our  Endeca  installa0on.    We  have  built  visualiza0ons  such  as  tree   maps,  flow  diagrams,  sun  burst  diagrams,  scaMer  plots  showing  clusters,  and   hierarchical  edge  bundling  diagrams  to  explore  our  data  sets.       ! Our  data  tends  to  be  qualita0ve  rather  than  quan0ta0ve  so  this  drives  much  of  our   visualiza0ons. ! So  yes,  interac0ve  descrip0ve  sta0s0cal  visualiza0on  would  be  helpful  –  on  the   complete  data  set  and  individual  aMributes.”
    • Needs 1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? ! 2. What are the most common visualizations you use to present findings or share insights? What are the most valuable? “(1) We do a lot of chi-square tests, permutation tests, false discovery rate correction, Bonferroni correction, 2x2 Fisher exact test, logistic regression.  ! ! I also use SVM, Artificial Neural Networks (ANN), Naive-Bayes Classifiers (NBC), parts of speech taggers.”! ! (2) ROC curves, tables with p-values or odds ratios or hazard ratio (http://en.wikipedia.org/wiki/ Hazard_ratio)! ! Things  p-value! XYZ1    0.001! XYZ2 ...! etc.”
    • Needs 1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? ! 2. What are the most common visualizations you use to present findings or share insights? What are the most valuable? ! “Logistic Regression, Decision Trees, Markov Models, Area Under Curve”
    • Casual Analyst Analytical Manager Data Skills Level Customize Models Low / none High Composition CapabilityLow / Use High / Make Create New Models Create Complex Models Analyst Sense Makers: Information Management Ability Use Models Problem Solver Data Scientist
    • Materials • http://www.datasciencecentral.com/ • Ben Lorica’s blog: http://strata.oreilly.com/ben • https://blog.twitter.com/tags/twitter-data • http://www.slideshare.net/s_shah/the-big-data-ecosystem-at- linkedin-23512853
    • Algorithms (ex: computational complexity, CS theory) Back-End Programming (ex: JAVA/Rails/Objective C) Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS) Big and Distributed Data (ex: Hadoop, Map/Reduce) Business (ex: management, business development, budgeting) Classical Statistics (ex: general linear model, ANOVA) Data Manipulation (ex: regexes, R, SAS, web scraping) Front-End Programming (ex: JavaScript, HTML, CSS) Graphical Models (ex: social networks, Bayes networks) Machine Learning (ex: decision trees, neural nets, SVM, clustering) Math (ex: linear algebra, real analysis, calculus) Optimization (ex: linear, integer, convex, global) Product Development (ex: design, project management) Science (ex: experimental design, technical writing/publishing) Simulation (ex: discrete, agent-based, continuous) Spatial Statistics (ex: geographic covariates, GIS) Structured Data (ex: SQL, JSON, XML) Surveys and Marketing (ex: multinomial modeling) Systems Administration (ex: *nix, DBA, cloud tech.) Temporal Statistics (ex: forecasting, time-series analysis) Unstructured Data (ex: noSQL, text mining) Visualization (ex: statistical graphics, mapping, web-based dataviz)
    • Skills
    • Figure 3-3. There were interesting partial correlations among each respondent’s primary Skills Group (rows) and primary Self-ID Group! (columns). The mosaic plot illustrates the proportions of respondents! who fell into each combination of groups. For example, there were few! Data Researchers whose top Skill Group was Programming. Skills