Your SlideShare is downloading. ×

eBay EDW元数据管理及应用

5,164

Published on

eBay 数据仓库元数据管理及应用

eBay 数据仓库元数据管理及应用

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
5,164
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. eBay EDW Metadata Management and Applications Dec 2011 熊家治 eBay 数据分析平台架构师 [email_address]
  • 2. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 3. The Birth of eBay . . . . . . sold for $14.83 USD Started with a Broken Laser Pointer . . . AuctionWeb was born on the Labor Day weekend in September 1995 Pierre Omidyar $30 eBay Founder
  • 4. The Birth of eBay . . . FREE Service Running Off from a Home Server . . . $240 USD/month Pierre Omidyar
  • 5. The Birth of eBay . . . Requesting for donations . . . Coins Personal Check Bills Money Order Coupons Movie Tickets
  • 6. The Birth of eBay . . . Start Profitable . . .
  • 7. The Birth of eBay . . . Initial Business Model and Target Users . . . Build equitable electronic marketplace for Americans to buy and sell their stuff
  • 8. eBay Facts 450+ Million Registered Users Over 2 Billion Photos 220+ Million Active Item Listing for sale 50,000 Categories 2 Petabytes Stored 25 Petabytes Processed daily 300+ Features per quarter 100,000 lines of code rolled out every 2 weeks 48 Billion SQL Calls Per day 5.5 Billion API Calls Per month > 4.4 GB Source Code - 16 Years After . . . Global Presents In 33 International Markets 10+ Million New Items Added Per Day $2,000+ USD Trading Value Per Second
  • 9. Analytical Data Platforms Singularity EDW Low End Enterprise-class System Discover & Explore Analyze & Report 20-50 concurrent users 500+ concurrent users Enterprise-class System >5 concurrent users Structure the Unstructured Detect Patterns Hadoop Developer System EDW/ODW (Primary& Secondary) “ Compare User Activity against last year” Trending and Forecast Analysis (large history) Operational Analytics Transactional Analytics High volume ad hoc queries Contextual-Complex Analytics Deep, Seasonal, Consumable Data Sets Production Data Warehousing Large Concurrent User-base Image Fingerprinting Image Classification Pattern Recognition Detect Counterfeits & SNADs
  • 10. eBay EDW
    • Born in 2000, now the largest TD installation in the world, the most powerful
    • Migrated from Oracle to Teradata
    • Transactional data; ~7 years of history
      • 70,000 tables
      • 200+ subject areas: Listings, bids, users, accounts, checkout, …..
    • 55,000 daily batch processes
    • Over 6,000 users relying on DW across all departments
    • Focus on Marketing , Search optimization, Keyword optimization, Trust & Safety , Customer Relationship Management, Finance,…
    • Strategic Investment
    • Monetized through data sales
    • Partners: Teradata, AbInitio, UC4, Microstrategy, Sun, SAS
  • 11. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 12. EDW Architecture
  • 13. Closed loop, active Data Warehouse Site Databases Analytical Reporting Enterprise DW Raw data: daily, hourly feeds Knowledge: Integrated, aggregated, augmented www.ebay.com Trust & Safety Customer Support
    • Traffic Tracking
    • Finding
    • Rules engines
    • Real time creative
    • Advertising
    • Fraud prevention
    • Fake detection
    Wisdom: informed, fact based actions Marketing
  • 14. APD– Resource Distribution Chennai, India Cognizant Technology Services (on shore / off shore model) Shanghai, CN DW Core Team, APD Ops anchor point for China based outsourcers (HP, DX). Core competencies DW Development, Business System Analysis, Quality Assurance, Architecture, Project Management Office and Production Support. Seattle, WA DW Core Team & anchor point for India based outsourcing. Core competencies in VLDB and highly efficient / scaleable arch (Next Gen). San Jose, CA BU Dedicated Teams (IMS, DMS, MRM, UBI), DW Core, and Arch & Ops. Core competencies in rapid development, VLDB, MPP, business analysis, DW Dev.
  • 15. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 16. EDW Metadata
    • Typically, the technical metadata that describes the data warehouse contains the following:
    • Data warehouse table structures
    • Data warehouse table attribution
    • Data warehouse source data (the system of record)
    • Mapping from the system of record to the data warehouse
    • Data model specification
    • Common routines for access of data
    • Definitions and/or descriptions of data
    • Relationships of one unit of data to another
    • More important, in the large scale DW system like eBay DW, we need provide more metadata to technical and end user.
    • Data flow
    • Table usage information
    • Subject dependency
    • System performance information
    We designed a tool to collect up-to-date ETL metadata automatically.
  • 17. Subject Areas and Tables
  • 18. Domains
  • 19. Data Model Management
  • 20. ETL Metadata Collecting Automation Meta Data Analysis Engine Meta Data Repository DBQL
    • DBQL (Teradata 日志 )
    • 保存所有用户 /Batch Query
    • 包含 Query Level 的所有信息,如运行时间 /CPU COST/SQL 语句等
    • Analysis Engine( 分析引擎 )
    • 用于对 Query 语句进行分析
    • 根据分析所得信息整理成为元数据
    • Repository( 元数据库 )
    • Subject 信息
    • ETL 信息
    • Query information
    • Table 信息
    TUM
    • TUM( 表使用信息 )
    • 表被使用的信息
    • 源表 / 目标表标志
  • 21. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Other Applications
    • Q&A
  • 22.
    • The Data Flow Diagram (DFD) generation tool is an automated solution to generate DFDs for all EDW tables. The DFD shows how data is being flowed through from within the EDW production environment.
    Killer App 1: Data Flow Diagram(DFD)
  • 23. BEFORE/AFTER
    • BEFORE
      • Drawing a DFD is very time-consuming manual work
      • Very limited manually drawn DFD are available and no centralized DFD repository
      • Most DFDs is generated at the development stage, and there are very easily be out-dated.
      • DFD accuracy not guaranteed.
      • DFD can only provide very limited information
    • AFTER
      • Automatically generate DFD for all EDW Target Tables
      • All EDW Target Tables DFDs will be available on Datahub
      • DFD refreshed on Weekly based to ensure accuracy
        • Can be Converted to daily basis if needed
      • DFD generated based on real process activities running on the production environment.
      • DFD with enriched information
        • Table dependency
        • Jobs detail of each step
        • Runtime info such as CPU consumption and runtime
        • Critical Path for batch enhancement
  • 24. DFD Tool Architecture DBQL Table Usage Info
    • DBQL and Table Usage Info are Teradata Dictionary Tables
    • DBQL: Contains each query details, such as runtime, CPU cost, queryband ect.
    • Table Usage Info: What table(s) is been used by the query
    • Data Flow Analysis Engine analyze the raw data of DBQL and Table Usage Info, get dependency metadata about table(s)
    • On batch script (job)level, what table(s) is output table of the script(job)
    • What table(s) is input table of script(job)
    • DFD MetaData contains the result of Analysis Engine, including
    • DFD dependency meta data of each table, with the meta data, we could draw DFD for any table via the tool Graphviz.
    • Each script(job) is a node of the diagram
    • The dependency between script(job) setup the mapping between nodes.
    DFD Repository is the collection of DFDs, we organize and display online Data Flow Analysis Engine DFD Meta Data DFD Repository
  • 25. How to Read DFD? Step2: the step number is ordered by the job start time Job Start/End Time(HH:MM:SS) The script(job) name to populate the table in the step The output table of step1, also, it is the input table of step2 Round Corner Rectangle: The upstream tables from other subject area Blue line: Stands for the process critical path Set Background as gray to highlight the target table of the diagram
  • 26. DFD’s Homepage
  • 27. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 28.
    • Benefits
    • Reduce in overall IT expenditures
    • Freeing up valuable resources
    • Extend the life of value-generating system
    • Enhance User Experience
    Killer App 2: Data Rationalization
    • Data rationalization is the use of meta data to determine the optimum collection of data to provide the greatest business benefit to the end user.
  • 29. Table Usage Metrics Platform
    • The Table Usage Metrics (TUM) platform is a series of tools to capture the table usage information for analysis purposes.
  • 30. Table Usage Metrics
    • Table Basic Info
    • Table Size
    • Table Skew Factor
    • Table PI
      • Table Usage Matrices
      • CPU cost
      • Downstream Batch Hits
      • End User Hits
      • Table Refresh Info
    • Refresh Frequency
    •   Average Complete Time
    2011-07-20/2011-07-26 Primary Secondary TableName Table Size(GB) Table Skew Factor Daily CPU Cost Batch Hits End User Hits Loading Strategy Finish Time Table Size(GB) Table Skew Factor Daily CPU Cost Batch Hits End User Hits Loading Strategy Finish Time Table 1 3,990 5 308,778 0 15 Daily Batch 15 3,969 2 323,822 0 26 Daily Batch 14 Table2 3,125 4 76,274 24,837 28,338 Daily Batch 2 3,071 2 124,307 13,710 21,126 Daily Batch 3
  • 31. Data Rationalization Process
  • 32. Typical Approaches
      • Table Retirement
      • Frequency Reduction
      • Table Data Retention
      • Table Size Reduction by removing End of Life Data Elements
  • 33. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 34. ETL JOB RUNTIME INFO from all ETL SERVER UC4 TABLE USAGE MASTER DATA FLOW JOBTRACK REPOSITORY TERADATA QUERYLOG from TD1/TD2/TD3/TD5 TABLE DEPENDECY QUERY PATTERN QUERY USER BEHAVIOR USER QUERY/BATCTH JOB ENHANCEMENT MDR TABLE USAGE INRO ETL JOB STATUS JOB TRACK REPOSITORY DATA SOURCE Applications … JOBTRACK OVERVIEW
  • 35. JOBTRACK FEATURES AUTOMATION for Any Table + Any ETL JOB REALTIME + HISTORY + FORECAST ALL INFORMATION IN ONE PAGE NOT ONLY Dataflow, you can get all data about Data info you need
  • 36. Agenda
    • eBay EDW History
    • eBay EDW overview
    • Metadata Management
    • Killer APP1: Data Flow Diagram
    • Killer APP2: Data Rationalization
    • Killer APP3: JobTrack
    • Other Applications
    • Q&A
  • 37.
    • ETL system performance monitor
    • Product Management based on Metadata
    • Data Quality
    • Query Pattern
    • DW User Behavior Analysis
    • ETL Problematic path Analysis
    • Others
    Other Applications
  • 38. Q &A

×