Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Data: It's More Than Volume, Paypal

751 views

Published on

In this presentation Nachum Shacham talks about the uses and qualities of Big Data, and how they are utilised where he works at PayPal. He talks about the ultimate goal of extracting business value, as well as unlocking the true value of your data through use of algorithms and sufficient data further down the long tail.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Data: It's More Than Volume, Paypal

  1. 1. BIG DATA: IT’S MORE THAN VOLUME Nachum Shacham PayPal Big Data Innovation Summit April 2013
  2. 2. IT’S BIG-DATA TIME! Volume  big platforms Variety  multiple data types Velocity  fast response Value  a treasure of patterns
  3. 3. TECHNOLOGY HYPE CYCLE 3 DM Tech Forum BIG DATA
  4. 4. MIXED SIGNALS FROM THE PUNDITS • Data Lake • “Needle in a hay stack” • “All hay no needles” • “Yet another fad” • “Noth’n new: we’ve been analyzing data for 30 years” 4 DM Tech Forum • “Store’em and they’ll come” • “Don’t ever discard data” • “$524.752MM ROI in 3 years” • “Smart” … • “Hadoop is free” • “Just…”
  5. 5. USE YOUR OWN FILTER • Sift facts from MBS • Seek factual 1-liners • See through metaphors • Discount “Smart” (data, algorithms, systems) • Be skeptical 5 DM Tech Forum
  6. 6. UNLOCK THE VALUE IN BIG DATA • Data Trumps Algorithms • Sufficient data further down the long tail • Wisdom of the crowd  effective recommendations • Combine signals from different media 6 DM Tech Forum
  7. 7. BUSINESS VALUE IN BIG DATA 7 DM Tech Forum RISK ANALYSIS IDENTIFY INFLUENCERS IN SOCIAL GRAPHONLINE ADS REVENUE OPTIMIZATION FRAUD DETECTION AND PREVENTION
  8. 8. LET’S DIG INTO BIG DATA • Define KPIs • Explore • Model & Measure • Visualize signals • Test • Question test results • Rinse and Repeat 8 DM Tech Forum
  9. 9. BIG-DATA ANALYTICS FROM SEMI-STRUCTURED DATA TO BUSINESS SIGNALS 9 MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m _000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounter s)[(FILE_BYTES_WRITTEN)
  10. 10. Cloud RDBMS Data Warehouse Hadoop MPP PLATFORMS AS WORKBENCHES FOR BIG DATA AND THEIR TOOLS
  11. 11. CLASSES OF ANALYTICS JOBS Big Data Data organization for BI A few large models Many small models 11 DATA MANIPULATION GRAPHICS MODEL BUILDING CROSS VALIDATION PROBLEM MR FORMULATION
  12. 12. MATCH THE JOB TO THE PLATFORM
  13. 13. Data Sourcing Data Preparation Exploratory Data Analysis Predictive Models Visualization Reporting R: THE TOOL FOR ALL ANALYTICS STEPS R
  14. 14. data files process lines set sorting key and value output <key, value> Collect segment data marked by key Process segment data Output processed segment data Shuffle sort Reducer.R Mapper.py Text processing Model per segment BI-LINGUAL HADOOP STREAMING: LARGE SCALE PARALLEL PREDICTIVE MODELING
  15. 15. SEMI-STRUCTURED DATA  TABULAR DATA Meta VERSION="1" . Job JOBID="job_201212150932_52151" JOBNAME=”DataFilter" USER=”user1234” SUBMIT_TIME="1355822133394" JOBCONF="hdfs://tmp/hadoop-hadoop/mapred/staging/user1234/.staging/job_201212150932_52151/job.xml" VIEW_JOB=" " MODIFY_JOB=" " JOB_QUEUE=”B" . Job JOBID="job_201212150932_52151" JOB_PRIORITY="NORMAL" . Job JOBID="job_201212150932_52151" LAUNCH_TIME="1355822223576" TOTAL_MAPS="50" TOTAL_REDUCES="0" JOB_STATUS="PREP" . Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" START_TIME="1355822133148" SPLITS="" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051” TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" START_TIME="1355822133545" TRACKER_NAME="tracker_dn0492.ebay.com:localhost.localdomain/127.0.0.1:33613" HTTP_PORT="50060" . MapAttempt TASK_TYPE="SETUP" TASKID="task_201212150932_52151_m_000051" TASK_ATTEMPT_ID="attempt_201212150932_52151_m_000051_0" TASK_STATUS="SUCCESS" Task TASKID="task_201212150932_52151_m_000051" TASK_TYPE="SETUP" TASK_STATUS="SUCCESS" FINISH_TIME="1355822133162" COUNTERS="{(FileSystemCounters)(FileSystemCounters)[(FILE_BYTES_WRITTEN) (FILE_BYTES_WRITTEN)(27089)]}{(org.apache.hadoop.mapred.Task$Counter) (Map-Reduce Framework)[(SPILLED_RECORDS)(Spilled Records)(0)]}" . Job JOBID="job_201212150932_52151" JOB_STATUS="RUNNING" . Task TASKID="task_201212150932_52151_m_000001" TASK_TYPE="MAP" START_TIME="1355822133163" attempt,201212171719,248176,m,000013,0,1355499674337,1355499903213,MAP,SUCCESS,default,rack3,lvsaishdc3dn0109,0109 attempt,2012121771719,248176,m,000464,0,1355501042650,1355501253259,MAP,SUCCESS,default,rack5,lvsaishdc3dn0217,0217 attempt,2012121771719,248176,m,000626,0,1355501212902,1355501366476,MAP,SUCCESS,default,rack17,lvsaishdc3dn0776,077 6 attempt,2012121771719,248176,m,001193,0,1355499673762,1355499887662,MAP,SUCCESS,default,rack8,lvsaishdc3dn0366,036 attempt,2012121771719,248176,m,001355,0,1355499673545,1355499908182,MAP,SUCCESS,default,rack9,lvsaishdc3dn0386,0386 attempt,2012121771719,248176,m,001517,0,1355501266524,1355501470527,MAP,SUCCESS,default,rack5,lvsaishdc3dn0236,0236 attempt,2012121771719,248176,m,001850,0,1355501303142,1355501486691,MAP,SUCCESS,default,rack5,lvsaishdc3dn0235,0235
  16. 16. FROM TABULAR DATA TO BI 16 DM Tech Forum
  17. 17. PARALLEL SEGMENTED MODELING 17 R R R R R MAPPERS REDUCERS
  18. 18. MODELS BUILT ON LARGE DATASETS 18 Meta VERSION="1" . Job JOBID="job_201112150932_52151" JOBNAME=”DataFilter" USER=”user1234” LAUNCH_TIME="1324801865576” TIME INTERVAL DATA CONCURRENCY PERCENTILES TIME SERIESWORD COUNT REPRESENTATION AVOID RAM LIMITATIONS R STAT PROCESSING
  19. 19. Cloud R LEVERAGING RDBMS POWER teradataR Scidb-R
  20. 20. TERADATAR FUNCTIONS (SAMPLE) Function Name What it does td.zscore Zscore Transformation td.t.paired T Test Paired td.cor Correlation Matrix td.f.oneway One way F Test td.factanal Factor Analysis td.freq Frequency Analysis td.hist Histograms td.kmeans K-Means Clustering td.ks Kolmogorov Smirnov Test td.mode Mode Value of Column td.tapply Apply a function over a database column td.summary Like R summary() td.quantiles Quantile Values td.rank Rank
  21. 21. ANALYSIS OF A TABLE WITH > 1B ROWS >library(RJDBC) >library(teradataR) >tdConnect(”TD_WH", uid = tdlogin, pwd = tdpwd, database = ”myVDM”) > system.time(myTbldf <- td.data.frame(”myTbl")) user system elapsed 0.092 0.054 140.071 > dim(myTbldf ) [1] 1,131,670,269 9 > system.time(cor <- td.cor(myTbldf[3:9])) user system elapsed 0.021 0.003 6.722 C D E F G H I C 1.0000000 0.7096425 0.22154483 0.24186862 0.13354501 0.4954111 0.19577803 D 0.7096425 1.0000000 0.24272691 0.27590234 0.13358632 0.4279517 0.14634683 E 0.2215448 0.2427269 1.00000000 0.08940507 0.03734827 0.1631614 0.04401034 F 0.2418686 0.2759023 0.08940507 1.00000000 0.07664496 0.1686094 0.04744032 G 0.1335450 0.1335863 0.03734827 0.07664496 1.00000000 0.1247046 0.05837435 H 0.4954111 0.4279517 0.16316144 0.16860940 0.12470460 1.0000000 0.35395733 I 0.1957780 0.1463468 0.04401034 0.04744032 0.05837435 0.3539573 1.00000000
  22. 22. CONCLUSION • Big data is here. See through the hype • Analyze big data to extract value • Multiple technologies & analytics tools are out there • Match platform, tools and approach • Delegate massive processing to big clusters
  23. 23. QUESTIONS?
  24. 24. BIG DATA EMPOWERS ALGORITHMS Banko & Brill “Scaling to Very Very Large Corpora for Natural Language Disambiguation”

×