Experiences with big data by Srinivasan Seshadri
Upcoming SlideShare
Loading in...5
×
 

Experiences with big data by Srinivasan Seshadri

on

  • 208 views

Talk by Srinivasan Seshadri, founder of Zettata at The Hive Big Data Think Tank Meetup - Healthcare 2.0 hosted at the EMC India

Talk by Srinivasan Seshadri, founder of Zettata at The Hive Big Data Think Tank Meetup - Healthcare 2.0 hosted at the EMC India

Statistics

Views

Total Views
208
Views on SlideShare
205
Embed Views
3

Actions

Likes
0
Downloads
2
Comments
0

2 Embeds 3

https://www.linkedin.com 2
http://www.linkedin.com 1

Accessibility

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Experiences with big data by Srinivasan Seshadri Experiences with big data by Srinivasan Seshadri Presentation Transcript

  • EXPERIENCES WITH BIG DATA SRINIVASAN SESHADRI, FOUNDER ZETTATA
  • WORLD BEFORE BIG DATA It is a Capital Mistake to Theorize before one has Data                  Sherlock Holmes
  • HOWEVER, DO NOT WANT TO BE HERE
  • AXIOMS • Measure, Measure, Measure • Garbage in, Garbage out • Correlation is not Causation • More Data Beats Cleverer Algorithms • Algorithms that do better with more data are more interesting • Independent Sources Of data add new signals • Feature Engineering is the key to being a good data scientist • How do machines and Human interplay in Big Data? • Learn many models ‐ ensembles • Outliers are always interesting..
  • MEASURE, MEASURE, MEASURE • Have a Hypothesis • Create a metric to determine if hypothesis is correct • Build a solution that can be measured  • Iterate If you can not measure it you can not improve it – Lord Kelvin
  • GARBAGE IN GARBAGE OUT
  • WHAT DO YOU WANT THE ANSWER TO BE?
  • CORRELATIONS
  • CORRELATION IS NOT CAUSATION • Correlation in Data Need Not Imply Correlation in Real Life • Can find random correlations in large amounts of data • Correlation Does Not Imply Causation
  • CORRELATION IS NOT CAUSATION
  • CORRELATION STRIKES AGAIN!!
  • MORE DATA BEATS CLEVERER ALGORITHMS • Adding IMDB data For Netflix prize • Adding Protein Expression Data or Patient Data to Gene Expression Data • Bag of Words Approach for Word Sense Disambiguation
  • WORD SENSE DISAMBIGUATION • Bank • Sloping Land Alongside a river or a lake. It typically has thick vegetation growing.. • A financial institution that takes deposits from some customers and gives loans to others who require the  money. To disambiguate in typical sentences look for co‐occurrences of words with words in definition.  Unsupervised Learning. Bootstrap a model. The pilot landed the plane on the Hudson River amongst several boats and an appreciative audience  cheered from the banks of the river. He issued a check and took it to the bank so he could transfer money. Can look for frequent co‐occurrences with each sense of the word (boats and check respectively) and build  a larger bag of words in which to disambiguate.
  • WORD SENSE DISAMBIGUATION
  • FEATURE ENGINEERING Can not expect arbitrarily complex models to be learned by the computer
  • FEATURE ENGINEERING CITYY 1 LAT. CITY 1 LNG. CITY 2 LAT. CITY 2 LNG. DRIVABLE? 123.24 46.71 121.33 47.34 Yes 123.24 56.91 121.33 55.23 Yes 123.24 46.71 121.33 55.34 No 123.24 46.71 130.99 47.34 No
  • FEATURE ENGINEERING DISTANCE (MI.) DRIVABLE? 14 Yes 28 Yes 705 No 2432 No
  • OF HUMANS AND MACHINES • Partnership is important • Aha moment and the strategy comes from humans.. • Machines do the hard work of calculating fast and do not tire  • Maybe some day Machines will be able to do more than they are asked to do explicitly.. Today Explicit  Instructions are the norm..
  • ENSEMBLES ‐ OUTLIERS ARE NOT INTERESTING – FOR  CLASSIFIERS • Learn many models from random subsets of training data  • Effect of outliers is reduced on a majority of the models • Random Forests
  • OUTLIERS ARE ALWAYS INTERESTING FOR RANKING  PROBLEMS • You have to be so good that they can not ignore you • My personal thesis: Average in everything is boring. Be  outstanding in something. • Outliers along some dimension always have interesting  information – whenever you are combining multiple  variables to come up with one global rank • Search • Job Interviews!
  • UNKNOWN UNKNOWNS – VERY INTERESTING TO A  BUSINESS – OUTLIERS
  • BIG DATA AND HEALTHCARE
  • ARE YOU IN THE JOB MARKET?
  • www.zettata.com sesh@zetatta.com Thanks!