CloudCon, Tuesday, October 2nd, 11am
Every Second – in over thousands of Categories
Value > Cost                         $’s per year in incremental revenuewww.wallpapertimes.com
incremental   storage                            Volume                            DATA    structured    Variety          ...
Analyze & Report                                                                         Discover & Explore       Structur...
Data Growing Faster
Data         questions later         structure later              (<$0.04/GB, <$80/2TB)single HDFS instances >50PBValue > ...
Designing for the Unknown>85% of analytical workload is NEW & UnknownThe metrics you know are cheapThe metrics you don’t k...
•   Impact
Site   Key               Expansion       Top Query   NoteUS     diaries           diaryUS     baggies           baggyUS   ...
Site   Key               Expansion       Top Query                      NoteUS     diaries           diary           vampi...
Value > Cost                         $’s per year in incremental revenuewww.wallpapertimes.com
Toys and HobbiesATC   >   Artist trading card   in ARTATC   >   Automatic Tool Change in Business and Industrial
German Compound Words •   German compound words can be arbitrarily created and extremely long         Adidastrainingsanzug...
Synonyms derived from top queries in item query clusterstexas instruments ba ii plus     ti ba ii plusbrighton handbag    ...
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
The New Alchemy Turning Data into Gold
Upcoming SlideShare
Loading in …5
×

The New Alchemy Turning Data into Gold

259 views

Published on

The New Alchemy
Turning Data into Gold By Brian Johnson
Engineering Director, eBay Search Science

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
259
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • I work at eBay, every second…BLANK SLIDEGrocery store – 2 cans of soupPoint of No Return – people haven’t changed, motivation still the same, everyone loves free, don’t waste hard earned resources, make intelligent decisions – technology HAS changed, and it is still accelerating, behavior easier to capture/analyzeSkip - Costco – Netflix – Blu-Ray – Players I considered – Players I DIDN”T consider – Person asked, data collected - WHY? – Mobile Phone – 3-5x speed of home network – blend online/offline – just commerce
  • You are in business to make moneyHow do you know if changes you make, make moneyYou HAVE to testYou can’t manage what you don’t measureTesting is crucialImage http://www.wallpapertimes.com/files/q/Yf/4j/qYf4jp9q86379020_800x600.jpg
  • Is my data BIG enough – who caresI don’t really care about defining how big, big is.Big is whatever you need to detail level (not aggregate) analysisImage http://www.skimountaineer.com/ROF/OcAnt/BigBen/BigBenHeardIsland.jpg
  • Beyond aggregate dataSession level detailItem impression data – logging the items people DON’T clickWe always knew what items people clicked on (view item page log)What about the items people did NOT click on, need impression logging, they’re just as informativeLet’s bring this closer to home for youMarket basket data – buy this buy thatCowboy hats – detailed data
  • Before we talk about the systems we have in place, let’s take a look at what happens in the industry and describe the buzz word of the year – Big Data.A big data warehouse is a data warehouse that is a magnitude bigger than the one you have. So just the data volume is no what Big Data is about. The key change is the form of the data and its processing requirements. Since 2003 there is more data processed in 2 days than what human mankind has produced in the last 40.000 years. The rise of the machines!Classical data warehousing stores data attributes in columns, nicely separated by the source application, or the ETL process. Data that is usually generated by direct user interactions and clearly defined transactions. The big boost in data volume comes from new data types like free form text, audio, video, pictures, and graphs that do not easily fit into the structures of a database, or pose quite some challenge on the processing of it. The third key characteristic of Big Data is the velocity, both in regards to speed of processing as well as speed of change. Initial use cases of Hadoop like spam filtering imply real time processing combined with tremendous amounts of data.With this in mind now, let’s look at what analytics systems we have in place today.
  • What do we have at eBayDW for analysts comfortable with SQL &amp; reportingHadoop for developersYou don’t have to do everything all at once, start and evolve
  • Data is growingLand it ONCEAdd moore’s law graphicGet back up data for data rate changeJeff H slides?Google VP Marissa Mayer made last August 2009, &quot;The Physics of Data,&quot; Mayer noted that there have been three big changes to Internet data in recent times:Speed (real-time data);Scale (&quot;unprecedented processing power&quot;);Sensors (&quot;new kinds of data&quot;).Mayer went on to say that there were 5 exabytes of data online in 2002, which had risen to 281 exabytes in 2009. That&apos;s a growth rate of 56 times over seven years. Partly, she said, this has been the result of people uploading more data. Mayer said that the average person uploaded 15 times more data in 2009 than they did in 2006.http://blog.appro.com/the-big-data-challenge-for-data-intensive-computing-applications/http://www.enterpriseirregulars.com/40616/the-enterprise-opportunity-of-big-data-closing-the-clue-gap/http://www.ameinfo.com/231603.htmlhttp://www.f5.com/images/news-press-events/data-growth-monster.pnghttp://www.veecom.co.uk/2010/the-difficulties-of-streaming-video-over-3g/http://www.kurzweilai.net/the-law-of-accelerating-returnshttp://techcrunch.com/2010/03/16/big-data-freedom/
  • Data is growingLand it ONCEAdd moore’s law graphicGet back up data for data rate changeJeff H slides?Google VP Marissa Mayer made last August 2009, &quot;The Physics of Data,&quot; Mayer noted that there have been three big changes to Internet data in recent times:Speed (real-time data);Scale (&quot;unprecedented processing power&quot;);Sensors (&quot;new kinds of data&quot;).Mayer went on to say that there were 5 exabytes of data online in 2002, which had risen to 281 exabytes in 2009. That&apos;s a growth rate of 56 times over seven years. Partly, she said, this has been the result of people uploading more data. Mayer said that the average person uploaded 15 times more data in 2009 than they did in 2006.http://blog.appro.com/the-big-data-challenge-for-data-intensive-computing-applications/http://www.enterpriseirregulars.com/40616/the-enterprise-opportunity-of-big-data-closing-the-clue-gap/http://www.ameinfo.com/231603.htmlhttp://www.f5.com/images/news-press-events/data-growth-monster.pnghttp://www.veecom.co.uk/2010/the-difficulties-of-streaming-video-over-3g/http://www.kurzweilai.net/the-law-of-accelerating-returnshttp://techcrunch.com/2010/03/16/big-data-freedom/
  • Let me summarize before search behavioral data I work with to show you how you can use these principles to analyze your data
  • Would you throw away money?Collect data, what seems big and expensive today will be be cheap and valuable tomorrow. Don’t throw good data away.
  • Embed analytics in your businessMake it easyAgile Analytics – is ability to support analytical requirements in a TIMELY manner, irrespective of the their complexity.Enable business agility vs development agilityAgile Analytics enables business to quickly and accurately make decisions.Image from http://jonmell.co.uk/enterprise-20-enables-business-agility/
  • Documents not enough anymoreNeed behavioral data – Yandex beating Google in Russia, why, they have users, refrigerators in Moscow vs. isolated small town
  • 1 week &gt; 6 months, 50 GB &gt; 100 TB, related search collaborative &gt; collaborative + success + NLP + overlap/partition + …
  • Get started
  • You are in business to make moneyHow do you know if changes you make, make moneyYou HAVE to testYou can’t manage what you don’t measureTesting is crucialImage http://www.wallpapertimes.com/files/q/Yf/4j/qYf4jp9q86379020_800x600.jpg
  • How do we do thisSimple counting – that’s it, you “just” have to countImage http://www.csie.ntnu.edu.tw/~u91029/Matching.html
  • Detail mattersContext is important
  • &quot;beef labeling regulation &amp; delegation of supervision law” - long word
  • Synonyms for example…Wordle http://www.wordle.net/show/wrdl/4067504/biglarge bigample, sizeableastronomic, astronomical, galacticbear-sizedblown-upbroad, spacious, widebulkycapaciouscolossal, prodigious, stupendousdeepdoubleenormous, tremendouscosmicelephantine, gargantuan, giant, jumboepic, heroic,extensive, extendedgigantic, mammothgreatgrandhuge, immense, vast, Brobdingnagianhulking, humongous, banging, thumping, whopping, wallopingking-sizelarge-scalelife-size,macroscopicmacromassive, monolithic, monumentalmassivemonstrousmountainousoutsize, outsized, oversize,supertitanicvoluminouswhacking
  • The New Alchemy Turning Data into Gold

    1. 1. CloudCon, Tuesday, October 2nd, 11am
    2. 2. Every Second – in over thousands of Categories
    3. 3. Value > Cost $’s per year in incremental revenuewww.wallpapertimes.com
    4. 4. incremental storage Volume DATA structured Variety Velocity processingsemi-structured change un-structured
    5. 5. Analyze & Report Discover & Explore Structured Semi-Structured Unstructured SQL SQL++ Java/C++/Pig/HiveProduction Data Warehousing Contextual-Complex Analytics Structure the Unstructured Large Concurrent User-base Deep, Seasonal, Consumable Data Sets Detect Patterns Data Warehouse Data Warehouse + Hadoop BehavioralEnterprise-class System Low End Enterprise-class System Commodity Hardware System 8+PB 60+PB 40+PB
    6. 6. Data Growing Faster
    7. 7. Data questions later structure later (<$0.04/GB, <$80/2TB)single HDFS instances >50PBValue > Cost 10
    8. 8. Designing for the Unknown>85% of analytical workload is NEW & UnknownThe metrics you know are cheapThe metrics you don’t know are expensive – but high in potential ROIExploration & Testing are core pillars of an analytics-driven organization
    9. 9. • Impact
    10. 10. Site Key Expansion Top Query NoteUS diaries diaryUS baggies baggyUS cranberries cranberryUS jogging jogUS fishing sticker fish stickersUK panels panellingUK protection protecterUK lining linedUK animation animatedUK trucks truckingUK edging edgesUK nets netting
    11. 11. Site Key Expansion Top Query NoteUS diaries diary vampire diariesUS baggies baggy patagonia baggies good for patagonia baggy, not good aloneUS cranberries cranberry the cranberriesUS jogging jog jogging strollerUS fishing sticker fish stickers fishing sticker sports vs. kids roomsUK panels panelling fence panelsUK protection protecter mcafee total protection 2012 screen protecter is top US queryUK lining lined pink lining changing bagUK animation animated animation celUK trucks trucking corgi trucksUK edging edges garden edgingUK nets netting purse nets
    12. 12. Value > Cost $’s per year in incremental revenuewww.wallpapertimes.com
    13. 13. Toys and HobbiesATC > Artist trading card in ARTATC > Automatic Tool Change in Business and Industrial
    14. 14. German Compound Words • German compound words can be arbitrarily created and extremely long Adidastrainingsanzug (Adidas track suit) Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz (beef labeling regulation & delegation of supervision law) • Syntactically, words can be combined and split in many ways. • Some words shouldn’t be de-compounded. beiden (both) – bei(at) den(the) • Too many candidates for Granitpflastersteine (granite paving stones) Granit(granite) pflastersteine(cobblestones) Granit(granite) pflaster(paving/band-aid) steine(stones) • Binding characters Hochzeitsschuhe (grammatically correct, 593 hits on ebay.de) Hochzeitschuhe (129 hits on ebay.de).
    15. 15. Synonyms derived from top queries in item query clusterstexas instruments ba ii plus ti ba ii plusbrighton handbag brighton purselenovo x200 thinkpad x200king bedspread king coverletrockabilly dress swing dress1963 ford falcon 63 falconjessica simpson hair extensions jessica simpson hairdo Abbreviations/acronym derived from query transitionsstanford ky stanford kentuckydc sub dc subwoofersnowboard helmet l snowboard helmet largemotorcycle cam motorcycle cameradiamond amp diamond amplifier

    ×