Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15


Published on

Deep ML Architecture at Wildcard: At Wildcard we think about technologies for a future native mobile web experience through cards. Cards are a new UI paradigm for content on mobile for which we schematize unstructured web content. Part of the challenge is to develop an understanding of online content through machine learning algorithms. The extracted information is used to create cards that are surfaced in the Wildcard iOS app and in other card ecosystems. I will describe the challenge and the way we structure the problem of content extraction with a deep architecture of classification and optimization algorithms that combines traditionally factorized problems of content extraction which allows the various stages to inform each other. The talk will include an overview of the used data, features and our training strategy with a partly human-powered labeling system. This ML system, called sic, is used in production and I will show our approach to using only fast or a mix of fast and slow features depending on the use case in the app.

Published in: Technology
  • Be the first to like this

Sven Kreiss, Lead Data Scientist, Wildcard at MLconf ATL - 9/18/15

  1. 1. Deep ML-Inspired Architecture at Wildcard Sven Kreiss, @svenkreiss
  2. 2. I am a Data Scientist at Wildcard. We launched last month and were featured in the App Store as “Best New App”.
 We are looking to grow our data team.
  3. 3. Wildcard 3 • founded in 2013 • develop technologies for a future native mobile web experience through cards • Cards: new UI paradigm for content on mobile for which we schematize unstructured web content. Surfaced in the Wildcard iOS app and in other card ecosystems.
  4. 4. Wildcard: View as Card 4
  5. 5. ML Challenge 5 • Extract online content through ML.
 Micro service in this talk powers 54% of cards in Wildcard. url {“title”: “…
  6. 6. Dataset 6 • Scrape articles from a diverse set of sources. • Custom labeling tools based on Databench:
  7. 7. Labeling Tools: Tree Based and Visual 7 • cross matched labels between the tools • inhouse label sessions before handing to offshore (usability) • assign labels to page elements
  8. 8. Content Tree Labeling 8
  9. 9. Visual Labeling 9
  10. 10. Features 10 • Text properties: length, capitalization, special characters, numbers, first 20 char identical to page’s meta title, … • BoW text: bag-of-words visible text • BoW meta: bag-of-words of CSS classes and other non-visible information inside HTML tags • html tag • Optional info from emulation: (x, y), (w, h), font-family, font-size, font-weight, …
  11. 11. Pipeline 11 • Parallelized document processing into features using Apache Spark. Starts from a list of urls. • Scrapes web pages. • Constructs Content Tree. • Matches labels. • Filters for quality. • Need the same processing for a single webpage but with low latency and small resource requirements:
 → pysparkling: pure Python implementation of 
 Spark’s RDD interface
  12. 12. pysparkling 12 • interface compatible with SparkContext and RDD but
 no dependence on the JVM • pysparkling.fileio can access local files, S3, HTTP, HDFS with a load-dump interface • used in Python micro-service endpoint applying 
 scikit-learn classifiers • used in labeling and evaluation tools and local development • used in dataset preparation tools
 (train-test split, split urls by domain, …)
  13. 13. Pipeline II 13 • single machine Random Forest training • “256Gb ought to be enough for anybody”
 (for machine learning) - Andreas Mueller • multithread support, fast • use provided structured data (e.g. meta tags) as much as possible
  14. 14. Architecture 14 • Morbi in sem quis dui placerat ornare. Pellentesque odio nisi, euismod in, pharetra a, ultricies in, diam. • Praesent dapibus, neque id cursus faucibus. • Phasellus ultrices nulla quis nibh. Quisque a lectus.
  15. 15. ML Algorithms Tough luck with Structured Learning
  16. 16. Algorithm: zeroth order 16 page elementpage element page element “title”“navigation” “author” scikit-learn
 RandomForest /html/body/div[2]/div/div/div/ul/li[5] /html/body/div[3]/h1 /html/body/div[3]/span
  17. 17. Algorithm: first order 17 page elementpage element page element “title”“navigation” “author”
  18. 18. Algorithm: second order 18 page elementpage element page element “title”“navigation” “author”
  19. 19. Requirements 19 • text-density based labeling is too rigid: we want to extend this to other types than news articles • clustering is too noisy: • ads in between paragraphs • cannot “cluster” authors after titles • CRF: complexity beyond linear-chain-CRF grows too quickly • want “single step” process: multi step algorithms erase information. 
 Example: if first step is to remove ads then second step cannot use information about ads to infer content.
  20. 20. First Attempt: Hypothesis Generation using Sampling 20 • start from a guess (using zeroth order type classifier) • generate variations of that guess with a proposal function • evaluate an objective function based on a
 document-wide likelihood function of classification probabilities
  21. 21. First Attempt: Hypothesis Generation using Sampling 21 page elementpage element page element “title”“navigation” “author” Sampling
  22. 22. First Attempt: Hypothesis Generation using Sampling 22 • decent results • training coverage questionable • slow inference
  23. 23. Second Attempt: “Deep Learning Inspired” 23 • Borrow ideas from “scene description”. Traditionally done with scene graphs and CRFs. • With Deep Learning, can avoid building a graph and go straight to assigning a label to every pixel. Clément Farabet, 2011
  24. 24. Second Attempt: “Deep Learning Inspired” 24 page elementpage element page element “title”“navigation” “author” “title”“navigation” “author”
  25. 25. Second Attempt: “Deep Learning Inspired” 25 page elementpage element page element “title”“navigation” “author” “title”“navigation” “author”
  26. 26. Feed forward process is much faster 26 Processing time dropped by an order of magnitude. No significant degradation in quality. Training:
 From urls: ~2 hours
 With cached external calls: <1 hour Introduced Forward Model Bucket Model for Load
  27. 27. Business-visible Successes 27 • embedded media content:
 Twitter cards, Instagram posts, 
 Facebook posts, Facebook videos 
 and Youtube videos • On the right, New York Magazine article on the train crash in Philadelphia: amtrak-train-derails-philadelphia.html
  28. 28. preliminary Business-visible Successes 28 • enabling domains that require JavaScript emulation
 (e.g. websites with pure AngularJS) • fixed individual publishers with high visibility 
 in our app • comparison to competition: 
 third party 71-82%, inhouse 83% +/- 4%
  29. 29. Summary 29 • dataset creation, processing pipeline, content tree creation, evaluation tools, labeling tools, training and inference strategies implemented over the past year • chose tools that allow quick iteration:
 simple processing in parallel, ML on single node • two open source projects: pip install databench pip install pysparkling • competitive performance,
 54% of cards in Wildcard are powered by pure ML @svenkreiss