Introduction to DISQL, a distributed programming framework widely used in Baidu
Upcoming SlideShare
Loading in...5
×
 

Introduction to DISQL, a distributed programming framework widely used in Baidu

on

  • 6,620 views

 

Statistics

Views

Total Views
6,620
Views on SlideShare
6,585
Embed Views
35

Actions

Likes
19
Downloads
185
Comments
1

3 Embeds 35

http://www.linkedin.com 24
https://www.linkedin.com 8
http://twitter.com 3

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • i like DQuery!
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Introduction to DISQL, a distributed programming framework widely used in Baidu Presentation Transcript

  • 1. Introduction to DISQL
    Chen Xiaoming
    Senior Engineer of Baidu IBASE Dept.
    陈晓鸣
    百度基础平台部
    高级工程师
    1
  • 2. What is DISQL?
    2
  • 3. DISQL is a distributed programming frameworkwidely used in Baidu
    3
  • 4. Contents
    Problems
    Solution
    Examples
    Rationales
    Adoption
    4
  • 5. Problems
    5
  • 6. Problems
    statistical analysis of logs
    extraction of fields
    in order to generate reports
    6
  • 7. Problems
    statistical analysis of features
    features of web pages, web sites, ads, user preferences, etc
    in order to provide data for data mining and machine learning
    7
  • 8. Problems
    common operations
    selecting, filtering, grouping, sorting, joining, etc
    8
  • 9. Solution
    9
  • 10. A Platform
    named Log Statistical Platform, a.k.a. LSP
    web-based
    convenient for secondary development
    convenient for task/data/rights management
    10
  • 11. A Programming Framework
    named DIstributed SQL, a.k.a. DISQL
    provide SQL-like operators which can be combined arbitrarily
    encapsulate distributed algorithms
    automatic code generation
    11
  • 12. Application Programming Interfaces
    named Distributed Query, a.k.a. DQuery
    DSL-style APIsembedded in well-known programming languages
    PHP so far, C++/Python,… in the future
    using method chainingtechnique to provide fluent interface
    data-flow in the form of DAGcomposed by chains of methods
    12
  • 13. Three Edit Modes – Simple Mode
    13
  • 14. Three Edit Modes – DQuery Mode
    14
  • 15. Three Edit Modes – Complex Mode
    15
  • 16. Hierarchy
    16
  • 17. DISQL Architecture
    Simple Mode
    DQuery Mode
    Complex
    Mode
    Edit Modes
    PHP
    C++
    Python
    APIs
    Normalizer
    Optimizer
    Splitter
    Planner
    Coder
    Translators
    Data-flow
    Schema
    Storage APIs
    Computing APIs
    17
    Runtimes
  • 18. LSP Architecture
    18
    data presentation & monitoring
    third party apps
    data access layer
    data management layer
    computing layer
    storage systems
    computing systems
  • 19. Examples
    19
  • 20. Example 1 – word count
    20
  • 21. Example 2
    given a log of query and ad shows
    extract site field from url field
    filter sites with regex
    calculate the amount of query and ad shows per site
    output in JSON format
    21
  • 22. Code in DQuery Mode
    22
  • 23. Rationales
    23
  • 24. Use Case Driven VS Completeness
    Our Solution
    Problem
    Problem
    Problem
    Problem
    24
  • 25. Internal DSL VS External DSL
    take advantage of:
    parsers, libraries and VMs of the host languages
    users and communities
    language features
    different from Pig, Hive, Sawzall, etc
    25
  • 26. Open/Closed Principles
    “open for extension, closed for modification”
    open for single machine algorithms, closed for distributed algorithms
    also different from Pig, Hive, Sawzall, …
    26
  • 27. Adoption
    27
  • 28. Users
    ……
    ……
    28
  • 29. Usage
    throughput/day: hundreds of TB
    tasks/day: thousands
    total tasks: > 1 million
    29
  • 30. Q&A
    also welcome to contact me with:
    • Twitter: @acumon
    • 31. Email: chenxiaoming@baidu.com
    • 32. Gmail/Gtalk: acumoncxm@gmail.com
    30
  • 33. The End
    THANK YOU!
    31