SlideShare a Scribd company logo
Scraping Engine
Packaging DeliveryEngine Monitor
Notification
Engine Properties
Configuration
Page Element
Configuration
Proxy Collector
Notification
Template
Event
Trigger
Property
Reader
Proxy
Claimer
Page Element
Reader
Page
Loader
Data
Processor
ScraperScraping
Case
DB
Web Scraping Solution
- ScrapeXpress Classic Solution Series -
Author: Andy Yang
Auditor:
Created Date: 13/10/2015
Last Updated: 13/10/2015
Version: 1.3
Dependency Document: BusinessRequirementOfWebScrapingForxxxx_v1.1.docx
1. overview
2. Business Modules
2.1. Core Module – Scraping Engine
Scraping Engine( also called Engine) - One Engine for one website
Execute a scraping task, scrape pre-defined data fields from web page and convert result
data into universal data object which is easy to be used by other modules, the module will invoke
other relevant modules to finish whole scraping task.
Components:
 Engine Event Trigger
Create event and fire it to Engine Monitor
 Engine Properties Reader
Read engine properties defined by Engine Properties Configuration
 Proxy Claimer
Claim a proxy ip from Proxy Pool maintained by Proxy Collector
 Page Elements Configuration Reader
Read and analyse the configuration of page elements defined by Page Element
Configuration
 Page Content Loader
Accept the url request, retrieve and transform web page content into the stream of html
source code
 Data Scraper
Parse html content and extract each data field from the stream of html source code by
invoking API of the third-party API, finally save these datas into universal data object.
 Data Processor
Save data into database or forward to Packaging Module
Send notification to specified user by Notification Module
Output
 Invoke Packaging Module to package result data into specified formatted file
 Invoke Delivery Module to put packaged file into target folder
 Fire exception event to Engine Monitor to handle these events.
 Invoke Notification Module to notice the specific user (Finance DEPT) the status of
scraping task by email
Input
 Read Engine Properties to control the engine running
 Read Web Page Structure Configuration to scrape specified data from web page
content correctly
 Dynamically claim a proxy IP and use it to access the target url.
2.2. Packaging Module
Accept the result data from Scraping Engine and convert data into formatted file, such as
EXCEL or CSV.
2.3. Delivery Module
Accept the delivery command from Scraping Engine and put packaged file into specified
folder.
Tips: This module can be extend to deliver data file by different way, such as by email.
2.4. Engine Monitor
Accept the exception event fired by Scraping Engine, generate the message content due to
the message template and invoke Notification Module to send message to ITD
2.5. Notification Module
According to the pre-defined method, accept message object and send message to specified
user by email.
3. Supporting Modules:
3.1. Engine Properties Configurator
Define the properties of Scraping Engine which will be used when Scraping Engine execute
a scraping task.
3.2. Page Element Configurator
Define the each data element that you want to scrape from web page
3.3. Proxy Collector
Collect free proxy server ip from online website, validate and submit available proxy ip into
Proxy Pool.
3.4. Notification Template Management
Create and maintain the template of notification message, so that we can adjust the content
and format depending on the business scenario.
4. Scraping Case System
4.1. Scraping Case Builder
 Define a Case
 put the Page Element Configuration into java code
 Initialise the data, such as search condition, running schedule...
4.2. Scraping Case Controller
 Run, stop a running case or start
 Log the status of scraping process.
5. Implement Strategy
We separate the whole progress of project into 3 stages:
1. Stage 1 : Basic Functions
2. Stage 2 : Support & Advance Functions
3. Stage 3: High Level Functions
Stage 1: Basic Functions
Scope
Develop essential modules and functions so that we can scrape data from 3 website
mentioned in requirement document, put some supporting modules and high level functions to stage
2 or stage 3.
Module Functions & Comment
Scraping Engine Engine Properties Reader :
only write the properties into java code instead of reading from config file
Reading properties from config file will be developed in stage 2;
Page Elements Configuration Reader:
only write the Page Element Configuration into java code instead of
reading from configuration file
Reading Page Element Configuration from config file will be
developed in stage 3;
Page Content Loader: Only directly access the target website instead of
via proxy server
via proxy server will be in stage2 or stage3
Proxy Claimer:
only define interface instead of claiming proxy ip from Proxy Pool
Claiming proxy ip from Proxy Pool will be in stage2
Engine Event Trigger:
Define essential events to be fired
According to the requirement, we will add new event in stage 2 and
stage 3.
Data Scraper:
Need invoke the third-side api to scrape data from web page according to
the Page Element Configuration instead of developing whole data
scraping algorithm.
We will rewrite the whole algorithm in stage 3
Data Processor:
Directly save data into database and package data into formatted file
Engine Monitor Able to these events fired by stage 2
Packaging Module Package data into Excel file and put it to specified folder
Notification Module Able to send essential notifications to ITD and Finance DEPT
Scraping Case
Builder
Create scraping case for 3 websites and initialise the search conditons;
put the Page Element Configuration into java code
Scraping Case
Controller
Provide start and stop functions to run scraping case to scraped data
no log functions
Workload Assessment of Stage 1
Jobs Work load (work day)
Preparation
1 Validate feasibility of technology
2 Prepare development environment and tools
3 Design and confirm data structure / definitions
Coding & Unit Testing
4 Program Scraping Engine
5 Program Engine Monitor
6 Program Packaging Module
6 Program Notification Module
Data Preparation
7 Collect search conditions from 2 websites:Divvy and
Parkhound
7 Create and initialise Scraping Case
8 Put the Page Element Configuration into java code
For 3 websites
Testing and Deployment
9 Build testing environment and testing
10 Build production environment and deploy system
Only run as standalone application
Maintenance and Document
11 On-site maintenance and fix bug
12 Write usage instructions
Not technology document
Stage 2: Support and Advance Functions
Module Functions & Comment
Scraping Engine Reading properties from config file
Access target web site via proxy server
Claiming proxy ip from Proxy Pool
Engine Monitor Update depending on real requirements
Notification Module Update depending on real requirements
Scraping Case
Controller
Update depending on real requirements
Engine Properties
Configurator
Maintain the properties of engine into config file or table
Proxy Collector Manually write proxy list into Proxy Pool
Scraping Case
Controller
Log the status of scraping process.
Stage 3: High Level Functions
Module Functions & Comment
Scraping Engine Read Page Element Configuration from config file or table
Rewrite the whole algorithm of scraping data depending on the Page
Element Configuration.
Page Element
configurator
Proxy Collector Automatically collect proxy ip from internet website, validate the
connectivity of proxy ips
Notification Template
Management
Define the template of message out of the system instead of writing
message content in java code.

More Related Content

What's hot

WebLogic FAQs
WebLogic FAQsWebLogic FAQs
WebLogic FAQs
Amit Sharma
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
Anil Saldanha
 
Apache course contents
Apache course contentsApache course contents
Apache course contents
darshangosh
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
Praveen Yadav
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
Niit Care
 
Architecture In Share Point2010
Architecture In Share Point2010Architecture In Share Point2010
Architecture In Share Point2010
Alexander Meijers
 
Share point review qustions
Share point review qustionsShare point review qustions
Share point review qustions
than sare
 
Oracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration IOracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration I
Sachin Kumar
 
Web servers
Web serversWeb servers
Web servers
Kuldeep Kulkarni
 
SUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off SessionSUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off Session
Anindita Bhattacharya
 
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck schedulerLinux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
pcherukumalla
 
Websphere interview Questions
Websphere interview QuestionsWebsphere interview Questions
Websphere interview Questions
gummadi1
 
Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4
than sare
 
State management
State managementState management
State management
Iblesoft
 
WebLogic for DBAs
WebLogic for DBAsWebLogic for DBAs
WebLogic for DBAs
Simon Haslam
 
Sharepoint Performance - part 2
Sharepoint Performance - part 2Sharepoint Performance - part 2
Sharepoint Performance - part 2
Regroove
 
Weblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencastWeblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencast
Rajiv Gupta
 
Personalization in webcenter portal
Personalization in webcenter portalPersonalization in webcenter portal
Personalization in webcenter portal
Vinay Kumar
 
Understanding iis part2
Understanding iis part2Understanding iis part2
Understanding iis part2
Om Vikram Thapa
 
introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)
Assay Khan
 

What's hot (20)

WebLogic FAQs
WebLogic FAQsWebLogic FAQs
WebLogic FAQs
 
Google App Engine
Google App EngineGoogle App Engine
Google App Engine
 
Apache course contents
Apache course contentsApache course contents
Apache course contents
 
Unit5 servlets
Unit5 servletsUnit5 servlets
Unit5 servlets
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Architecture In Share Point2010
Architecture In Share Point2010Architecture In Share Point2010
Architecture In Share Point2010
 
Share point review qustions
Share point review qustionsShare point review qustions
Share point review qustions
 
Oracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration IOracle Weblogic Server 11g: System Administration I
Oracle Weblogic Server 11g: System Administration I
 
Web servers
Web serversWeb servers
Web servers
 
SUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off SessionSUG Bangalore - Kick Off Session
SUG Bangalore - Kick Off Session
 
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck schedulerLinux VMWare image with Informatica , Oracle and Rundeck scheduler
Linux VMWare image with Informatica , Oracle and Rundeck scheduler
 
Websphere interview Questions
Websphere interview QuestionsWebsphere interview Questions
Websphere interview Questions
 
Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4Spring review_for Semester II of Year 4
Spring review_for Semester II of Year 4
 
State management
State managementState management
State management
 
WebLogic for DBAs
WebLogic for DBAsWebLogic for DBAs
WebLogic for DBAs
 
Sharepoint Performance - part 2
Sharepoint Performance - part 2Sharepoint Performance - part 2
Sharepoint Performance - part 2
 
Weblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencastWeblogic 11g admin basic with screencast
Weblogic 11g admin basic with screencast
 
Personalization in webcenter portal
Personalization in webcenter portalPersonalization in webcenter portal
Personalization in webcenter portal
 
Understanding iis part2
Understanding iis part2Understanding iis part2
Understanding iis part2
 
introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)introduction and configuration of IIS (in addition with printer)
introduction and configuration of IIS (in addition with printer)
 

Viewers also liked

Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Shougo Kim
 
공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서
Seongwon Eun
 
민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104
Borah Kang
 
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
Dongjae Lee
 
2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f
Jay Park
 
모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션
Sunnyrider
 
2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드
MezzoMedia
 
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
Buzzvil
 
2016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 2015122016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 201512
Nasmedia
 
MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]
MezzoMedia
 
2017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 12152017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 1215
Nasmedia
 
[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트
MezzoMedia
 

Viewers also liked (12)

Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
Gsc2015 봄 09 강민정-카이스트sk사회적기업가센터-소개
 
공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서공동주택 지역기반 커뮤니티 SNS 사업계획서
공동주택 지역기반 커뮤니티 SNS 사업계획서
 
민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104민영미디어렙 도입논의 토론회_091104
민영미디어렙 도입논의 토론회_091104
 
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
소셜미디어 동영상 콘텐츠 노하우_2014 블로터닷넷 콘퍼런스
 
2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f2015 smr리포트 4차_150428_f
2015 smr리포트 4차_150428_f
 
모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션모바일/온라인 게임의 매출시뮬레이션
모바일/온라인 게임의 매출시뮬레이션
 
2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드2015년 글로벌 광고시장 트렌드
2015년 글로벌 광고시장 트렌드
 
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
BUZZscape 2.0 - MMC 2016 발표자료 "한국 모바일 광고 생태계 어떻게 변화하고 있는가"
 
2016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 2015122016년 미디어 전망(f) 201512
2016년 미디어 전망(f) 201512
 
MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]MezzoMedia - Media & Market Report [12월 호]
MezzoMedia - Media & Market Report [12월 호]
 
2017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 12152017전망보고서 미디어이슈 1215
2017전망보고서 미디어이슈 1215
 
[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트[메조미디어] 2017년 미디어트렌드리포트
[메조미디어] 2017년 미디어트렌드리포트
 

Similar to ScrapeXpress-Standalone-solution

Parallelminds.asp.net with sp
Parallelminds.asp.net with spParallelminds.asp.net with sp
Parallelminds.asp.net with sp
parallelminder
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
Vivek chan
 
Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014
Lou Sacco
 
DEVICE CHANNELS
DEVICE CHANNELSDEVICE CHANNELS
DEVICE CHANNELS
Assaf Biton
 
Java EE Services
Java EE ServicesJava EE Services
Java EE Services
Abdalla Mahmoud
 
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman pluginoVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
Oved Ourfali
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
Mani Chaubey
 
Synopsis
SynopsisSynopsis
Asp.net control
Asp.net controlAsp.net control
Asp.net control
Paneliya Prince
 
KMS (1)
KMS (1)KMS (1)
KMS (1)
Satyaki Mitra
 
ASP.NET Lecture 5
ASP.NET Lecture 5ASP.NET Lecture 5
ASP.NET Lecture 5
Julie Iskander
 
UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...
Peter Muessig
 
ASP.NET Lecture 2
ASP.NET Lecture 2ASP.NET Lecture 2
ASP.NET Lecture 2
Julie Iskander
 
Server side rendering review
Server side rendering reviewServer side rendering review
Server side rendering review
Vladyslav Morzhanov
 
2310 b 15
2310 b 152310 b 15
2310 b 15
Krazy Koder
 
2310 b 15
2310 b 152310 b 15
2310 b 15
Krazy Koder
 
05 asp.net session07
05 asp.net session0705 asp.net session07
05 asp.net session07
Vivek chan
 
Power of ONE Automation through Web Services
Power of ONE Automation through Web ServicesPower of ONE Automation through Web Services
Power of ONE Automation through Web Services
CA | Automic Software
 
Parallelminds.web partdemo1
Parallelminds.web partdemo1Parallelminds.web partdemo1
Parallelminds.web partdemo1
parallelminder
 
Web components - An Introduction
Web components - An IntroductionWeb components - An Introduction
Web components - An Introduction
cherukumilli2
 

Similar to ScrapeXpress-Standalone-solution (20)

Parallelminds.asp.net with sp
Parallelminds.asp.net with spParallelminds.asp.net with sp
Parallelminds.asp.net with sp
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014Meteor Meet-up San Diego December 2014
Meteor Meet-up San Diego December 2014
 
DEVICE CHANNELS
DEVICE CHANNELSDEVICE CHANNELS
DEVICE CHANNELS
 
Java EE Services
Java EE ServicesJava EE Services
Java EE Services
 
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman pluginoVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
oVirt UI Plugin Infrastructure and the oVirt-Foreman plugin
 
06 asp.net session08
06 asp.net session0806 asp.net session08
06 asp.net session08
 
Synopsis
SynopsisSynopsis
Synopsis
 
Asp.net control
Asp.net controlAsp.net control
Asp.net control
 
KMS (1)
KMS (1)KMS (1)
KMS (1)
 
ASP.NET Lecture 5
ASP.NET Lecture 5ASP.NET Lecture 5
ASP.NET Lecture 5
 
UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...UI5con 2017 - UI5 Components - More Performance...
UI5con 2017 - UI5 Components - More Performance...
 
ASP.NET Lecture 2
ASP.NET Lecture 2ASP.NET Lecture 2
ASP.NET Lecture 2
 
Server side rendering review
Server side rendering reviewServer side rendering review
Server side rendering review
 
2310 b 15
2310 b 152310 b 15
2310 b 15
 
2310 b 15
2310 b 152310 b 15
2310 b 15
 
05 asp.net session07
05 asp.net session0705 asp.net session07
05 asp.net session07
 
Power of ONE Automation through Web Services
Power of ONE Automation through Web ServicesPower of ONE Automation through Web Services
Power of ONE Automation through Web Services
 
Parallelminds.web partdemo1
Parallelminds.web partdemo1Parallelminds.web partdemo1
Parallelminds.web partdemo1
 
Web components - An Introduction
Web components - An IntroductionWeb components - An Introduction
Web components - An Introduction
 

More from Andy Yang

Jxt job posting
Jxt job postingJxt job posting
Jxt job posting
Andy Yang
 
Sae job application export
Sae job application exportSae job application export
Sae job application export
Andy Yang
 
Integration solution with daxtra resume indexing
Integration solution with daxtra resume indexingIntegration solution with daxtra resume indexing
Integration solution with daxtra resume indexing
Andy Yang
 
Ctc people product development and release process
Ctc people product development and release processCtc people product development and release process
Ctc people product development and release process
Andy Yang
 
One push architecture plugin work with hub & board core
One push architecture   plugin work with hub & board coreOne push architecture   plugin work with hub & board core
One push architecture plugin work with hub & board core
Andy Yang
 
One push architecture plugin and container
One push architecture   plugin and containerOne push architecture   plugin and container
One push architecture plugin and container
Andy Yang
 
One push architecture total architecture
One push architecture   total architectureOne push architecture   total architecture
One push architecture total architecture
Andy Yang
 
Onepush platformtotalsolution
Onepush platformtotalsolutionOnepush platformtotalsolution
Onepush platformtotalsolution
Andy Yang
 
eDM system model
eDM system modeleDM system model
eDM system model
Andy Yang
 
eDM infrastructure
eDM infrastructureeDM infrastructure
eDM infrastructure
Andy Yang
 

More from Andy Yang (10)

Jxt job posting
Jxt job postingJxt job posting
Jxt job posting
 
Sae job application export
Sae job application exportSae job application export
Sae job application export
 
Integration solution with daxtra resume indexing
Integration solution with daxtra resume indexingIntegration solution with daxtra resume indexing
Integration solution with daxtra resume indexing
 
Ctc people product development and release process
Ctc people product development and release processCtc people product development and release process
Ctc people product development and release process
 
One push architecture plugin work with hub & board core
One push architecture   plugin work with hub & board coreOne push architecture   plugin work with hub & board core
One push architecture plugin work with hub & board core
 
One push architecture plugin and container
One push architecture   plugin and containerOne push architecture   plugin and container
One push architecture plugin and container
 
One push architecture total architecture
One push architecture   total architectureOne push architecture   total architecture
One push architecture total architecture
 
Onepush platformtotalsolution
Onepush platformtotalsolutionOnepush platformtotalsolution
Onepush platformtotalsolution
 
eDM system model
eDM system modeleDM system model
eDM system model
 
eDM infrastructure
eDM infrastructureeDM infrastructure
eDM infrastructure
 

Recently uploaded

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
Peter Muessig
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
sjcobrien
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
Octavian Nadolu
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
Peter Muessig
 
fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.
AnkitaPandya11
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
mz5nrf0n
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
Sven Peters
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
mz5nrf0n
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
dakas1
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
ShulagnaSarkar2
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
Peter Muessig
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
Alina Yurenko
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
Rakesh Kumar R
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
kalichargn70th171
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
XfilesPro
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
Patrick Weigel
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
Rakesh Kumar R
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
ToXSL Technologies
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
Grant Fritchey
 

Recently uploaded (20)

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling ExtensionsUI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions
 
Malibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed RoundMalibou Pitch Deck For Its €3M Seed Round
Malibou Pitch Deck For Its €3M Seed Round
 
E-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet DynamicsE-commerce Development Services- Hornet Dynamics
E-commerce Development Services- Hornet Dynamics
 
Artificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension FunctionsArtificia Intellicence and XPath Extension Functions
Artificia Intellicence and XPath Extension Functions
 
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s EcosystemUI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem
 
fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.fiscal year variant fiscal year variant.
fiscal year variant fiscal year variant.
 
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
在线购买加拿大英属哥伦比亚大学毕业证本科学位证书原版一模一样
 
Microservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we workMicroservice Teams - How the cloud changes the way we work
Microservice Teams - How the cloud changes the way we work
 
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样
 
一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理一比一原版(USF毕业证)旧金山大学毕业证如何办理
一比一原版(USF毕业证)旧金山大学毕业证如何办理
 
14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision14 th Edition of International conference on computer vision
14 th Edition of International conference on computer vision
 
UI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design SystemUI5con 2024 - Bring Your Own Design System
UI5con 2024 - Bring Your Own Design System
 
All you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVMAll you need to know about Spring Boot and GraalVM
All you need to know about Spring Boot and GraalVM
 
What next after learning python programming basics
What next after learning python programming basicsWhat next after learning python programming basics
What next after learning python programming basics
 
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf8 Best Automated Android App Testing Tool and Framework in 2024.pdf
8 Best Automated Android App Testing Tool and Framework in 2024.pdf
 
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
Everything You Need to Know About X-Sign: The eSign Functionality of XfilesPr...
 
WWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders AustinWWDC 2024 Keynote Review: For CocoaCoders Austin
WWDC 2024 Keynote Review: For CocoaCoders Austin
 
Fundamentals of Programming and Language Processors
Fundamentals of Programming and Language ProcessorsFundamentals of Programming and Language Processors
Fundamentals of Programming and Language Processors
 
How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?How Can Hiring A Mobile App Development Company Help Your Business Grow?
How Can Hiring A Mobile App Development Company Help Your Business Grow?
 
Using Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query PerformanceUsing Query Store in Azure PostgreSQL to Understand Query Performance
Using Query Store in Azure PostgreSQL to Understand Query Performance
 

ScrapeXpress-Standalone-solution

  • 1. Scraping Engine Packaging DeliveryEngine Monitor Notification Engine Properties Configuration Page Element Configuration Proxy Collector Notification Template Event Trigger Property Reader Proxy Claimer Page Element Reader Page Loader Data Processor ScraperScraping Case DB Web Scraping Solution - ScrapeXpress Classic Solution Series - Author: Andy Yang Auditor: Created Date: 13/10/2015 Last Updated: 13/10/2015 Version: 1.3 Dependency Document: BusinessRequirementOfWebScrapingForxxxx_v1.1.docx 1. overview
  • 2. 2. Business Modules 2.1. Core Module – Scraping Engine Scraping Engine( also called Engine) - One Engine for one website Execute a scraping task, scrape pre-defined data fields from web page and convert result data into universal data object which is easy to be used by other modules, the module will invoke other relevant modules to finish whole scraping task. Components:  Engine Event Trigger Create event and fire it to Engine Monitor  Engine Properties Reader Read engine properties defined by Engine Properties Configuration  Proxy Claimer Claim a proxy ip from Proxy Pool maintained by Proxy Collector  Page Elements Configuration Reader Read and analyse the configuration of page elements defined by Page Element Configuration  Page Content Loader Accept the url request, retrieve and transform web page content into the stream of html source code  Data Scraper Parse html content and extract each data field from the stream of html source code by invoking API of the third-party API, finally save these datas into universal data object.  Data Processor Save data into database or forward to Packaging Module Send notification to specified user by Notification Module Output  Invoke Packaging Module to package result data into specified formatted file  Invoke Delivery Module to put packaged file into target folder  Fire exception event to Engine Monitor to handle these events.  Invoke Notification Module to notice the specific user (Finance DEPT) the status of
  • 3. scraping task by email Input  Read Engine Properties to control the engine running  Read Web Page Structure Configuration to scrape specified data from web page content correctly  Dynamically claim a proxy IP and use it to access the target url. 2.2. Packaging Module Accept the result data from Scraping Engine and convert data into formatted file, such as EXCEL or CSV. 2.3. Delivery Module Accept the delivery command from Scraping Engine and put packaged file into specified folder. Tips: This module can be extend to deliver data file by different way, such as by email. 2.4. Engine Monitor Accept the exception event fired by Scraping Engine, generate the message content due to the message template and invoke Notification Module to send message to ITD 2.5. Notification Module According to the pre-defined method, accept message object and send message to specified user by email. 3. Supporting Modules: 3.1. Engine Properties Configurator Define the properties of Scraping Engine which will be used when Scraping Engine execute a scraping task. 3.2. Page Element Configurator Define the each data element that you want to scrape from web page 3.3. Proxy Collector Collect free proxy server ip from online website, validate and submit available proxy ip into Proxy Pool.
  • 4. 3.4. Notification Template Management Create and maintain the template of notification message, so that we can adjust the content and format depending on the business scenario. 4. Scraping Case System 4.1. Scraping Case Builder  Define a Case  put the Page Element Configuration into java code  Initialise the data, such as search condition, running schedule... 4.2. Scraping Case Controller  Run, stop a running case or start  Log the status of scraping process. 5. Implement Strategy We separate the whole progress of project into 3 stages: 1. Stage 1 : Basic Functions 2. Stage 2 : Support & Advance Functions 3. Stage 3: High Level Functions Stage 1: Basic Functions Scope Develop essential modules and functions so that we can scrape data from 3 website mentioned in requirement document, put some supporting modules and high level functions to stage 2 or stage 3. Module Functions & Comment Scraping Engine Engine Properties Reader : only write the properties into java code instead of reading from config file Reading properties from config file will be developed in stage 2; Page Elements Configuration Reader: only write the Page Element Configuration into java code instead of reading from configuration file
  • 5. Reading Page Element Configuration from config file will be developed in stage 3; Page Content Loader: Only directly access the target website instead of via proxy server via proxy server will be in stage2 or stage3 Proxy Claimer: only define interface instead of claiming proxy ip from Proxy Pool Claiming proxy ip from Proxy Pool will be in stage2 Engine Event Trigger: Define essential events to be fired According to the requirement, we will add new event in stage 2 and stage 3. Data Scraper: Need invoke the third-side api to scrape data from web page according to the Page Element Configuration instead of developing whole data scraping algorithm. We will rewrite the whole algorithm in stage 3 Data Processor: Directly save data into database and package data into formatted file Engine Monitor Able to these events fired by stage 2 Packaging Module Package data into Excel file and put it to specified folder Notification Module Able to send essential notifications to ITD and Finance DEPT Scraping Case Builder Create scraping case for 3 websites and initialise the search conditons; put the Page Element Configuration into java code Scraping Case Controller Provide start and stop functions to run scraping case to scraped data no log functions Workload Assessment of Stage 1 Jobs Work load (work day) Preparation 1 Validate feasibility of technology 2 Prepare development environment and tools 3 Design and confirm data structure / definitions
  • 6. Coding & Unit Testing 4 Program Scraping Engine 5 Program Engine Monitor 6 Program Packaging Module 6 Program Notification Module Data Preparation 7 Collect search conditions from 2 websites:Divvy and Parkhound 7 Create and initialise Scraping Case 8 Put the Page Element Configuration into java code For 3 websites Testing and Deployment 9 Build testing environment and testing 10 Build production environment and deploy system Only run as standalone application Maintenance and Document 11 On-site maintenance and fix bug 12 Write usage instructions Not technology document Stage 2: Support and Advance Functions Module Functions & Comment Scraping Engine Reading properties from config file Access target web site via proxy server Claiming proxy ip from Proxy Pool Engine Monitor Update depending on real requirements Notification Module Update depending on real requirements Scraping Case Controller Update depending on real requirements Engine Properties Configurator Maintain the properties of engine into config file or table Proxy Collector Manually write proxy list into Proxy Pool Scraping Case Controller Log the status of scraping process.
  • 7. Stage 3: High Level Functions Module Functions & Comment Scraping Engine Read Page Element Configuration from config file or table Rewrite the whole algorithm of scraping data depending on the Page Element Configuration. Page Element configurator Proxy Collector Automatically collect proxy ip from internet website, validate the connectivity of proxy ips Notification Template Management Define the template of message out of the system instead of writing message content in java code.