Web aggregation and mashup with kapow mashup server
1CHAPTER 1: INTRODUCTION1.0: IntroductionNowadays the Internet has become a very vast platform for storing information. With just a fewclicks, we can browse to a lot of information. Information stored in the World Wide Web(WWW) or in short, the Web can be accessed from anywhere, anytime and anyhow. Themassively increasing structured data on the Web (Data Web) and the need for novel methods toexploit these data to their full potential is the motivation of this thesis. Building on theremarkable success of Web 2.0 mashups, this thesis regards the websites as a database, whereeach web data source is seen as a table, and a mashup is seen as a query over these sources.Fast development with growing complexity of websites has made the Web become essential tothe Internet users. Besides providing information, websites become a platform where users canbe provided with services such as online booking system. This thesis explores the problem ofaggregating information about online booking system from several websites and delivers themthrough one point of access or portal. The aggregation tool used in this research is called KapowMashup Server.1.1 Problem StatementThere are already quite a number of available websites that support online booking services inMalaysia like Airasia (www.airasia.com), Malaysia Airlines System(www.malaysiaairlines.com), Firefly (www.fireflyz.com.my) and Maswing(www.maswing.com.my). However, looking for the right information such as price rate, datebooking, availability and so on will be time consuming since this has to be done throughrepetitive manual browsing of the relevant websites. The need of an automated system to providesuch information is important. Comparisons through a manual browsing will also not acompetitive way. Users want to make a comparison in terms of price rate, will have troubles todo it by browsing several websites.
2To aggregate the data from these several websites will need specific and right tools to do it.Different websites have their own architecture and data is located in the different frames. It is achallenge how to extract these data and offer integrated access through it in one portal.1.2 Motivational ExampleSearching on the Web, we can find lot of websites or portals that provide information aboutonline booking. We know that the most important information about online booking that userswould like to have is time, duration, price and comparison of several booking services.Let us consider the following scenario. When we search for information about a flight schedule,price, and destination and so on, we have to browse into many websites for the information. Isn’tit easy if we just browse into one website where from it we are able to browse all airlines thatavailable for a particular destination? Users also wish to make a comparison of the informationregarding online booking that they are about to make. To open many web browsers and make acomparison manually is not really an effective way.Another scenario is, when users want to book for hotel’s room online. As we know today, mostof hotels have their own website which provides online information booking services. Users maywant to know about the hotel price, availability, as well as check-in and check-out date. With onewebsite that controls and aggregates all these information, it eases users in making a comparison.With the advent of powerful tools for extracting and integrating data from these several of websites realizing this one stop portal is becoming lot easier.1.3 Research QuestionInternet world give us a lot of benefits especially in providing information. Within just a click onthe Internet, people can find any information that they want. There are millions of websites thatare available in the Internet as we know it. People need these information and data to help themmake a decision and to compare information. The question is how to manipulate the data?
3Internet technology has also developed rapidly towards greater efficiency. People can doanything through online transactions. Especially for online travelling ticket information, peoplewant to know about this information, date departure, time of departure, and price of the ticket.They want to compare all those information before they make a booking.Since there is a lot of website that they can use to make a booking, it is going to be a tediouswork to compare information between several website by opening it one by one. Then this thesiscomes with the question ―How to aggregate data into a single online ticket website?‖1.4 Aim and Research ObjectivesThe main aim of this research is to develop a prototype of a portal as a proof of concept for theproblem of aggregating information that are currently available in Malaysian web-based onlinebooking systems. Towards this end, we have identified the following specific researchobjectives:1. To identify tools and agents that is suitable for web mashup and aggregation.2. To explore a way to aggregate and mashup information on online booking system.3. To collaborate data from several online booking system and create a portal where the datacan be manipulate.1.5 Summary of ContributionsThere are two main contributions of this research work. First, a prototype has been developed asa proof of concept. The prototype is a portal which contains data or information that will extractfrom several online booking systems. The portal will display each information according touser’s needs.Secondly, a guidelines and manual will be providing on how a web aggregation and mashup canbe done with selection tools. The guideline will include what is need and what are thetechniques that involve building the prototype.
41.6 Thesis OrganizationThe rest of this thesis is organized as follows. Chapter 2, will describes on literature review,papers and works that have been done on web aggregation and mashup. In this chapter, severalworks or case studies will be discuss and state what scope that they have covered in the webaggregation and mashup.In Chapter 3, we will describe about methodology that will be used to achieve the objectives ofthe paper. Figure below show the brief about methodology that will be use:Figure 1.1: Thesis organizationIn Chapter 4, more explanations of the prototype in term of its design and implementation will beelaborate in details.Chapter 5 will be the conclusion and future enhancement of the thesis.Literature review andfinding informationSelection of Websitesand suitable toolsPrototypedevelopmentPublishing prototypeand guidelines
5CHAPTER 2: LITERATURE REVIEW2.1: IntroductionWorks on web aggregation and mashup have grown rapidly. In web development, a mashup is aweb page or application that uses or combines data or functionality from two or many moreexternal sources to create a new service.The term implies easy, fast integration, frequently using open APIs and data sources to produceenriching results that were not necessarily the original reason for producing the raw source data.According to Larry Dignan  based on the presentation by Gartner analyst David Gootzit thefuture of portal is mashups, SOA, more aggregation.2.2: Related worksMomondo.com is a travel search engine that allows the consumer to compare prices on flights,hotels and car rental. The search engine aggregates results from more than 700 travel websitessimultaneously to within seconds give an overview of the best offers found. Momondo doesn’tsell tickets; instead it shows the consumer where to buy at the best prices and links to thesupplier. It is free of charge to use Momondo, which receives commission from sponsored linksand advertising. In 2007 NBC Today’s Travel recommended that when it comes to finding thebest offers on flights the consumer should go to sites like Kayak, Mobissimo, SideStep andMomondo instead of buying tickets from third-party sites that actually sell travel and are dealingdirectly with the airlines. In addition to the price comparisons Momondo also offers city guideswritten by the sites users and by bloggers based in different cities.Kayak.com is a travel search engine website based in the United States. Founded in 2004, itaggregates information from hundreds of other travel sites and helps user’s book flights, hotels,cruises, and rental cars. Kayak combines results from online travel agencies, consolidators suchas Orbitz, and other sources such as large hotel chains.Like momondo.com, Kayak doesnt sell directly to the consumer; rather, it aggregates resultsfrom other sites then redirects the visitor to one of these sites for reservation. Thus, Kayak.com
6makes money from pay per click advertising, when the consumer clicks-through to one of thecompared websites (for example, when the consumer is redirected to the Orbitz website).2.3: Paper on Web Aggregation and MashupIn , they discuss about the design and implementation of a prototype web information systemthat users web aggregation as the core engine. Annotea is one of project that related to the field.Annotea is a Semantic Web based project for which the inspiration comes from users’collaboration problem in the web. It examined what users did naturally and selected familiarmetaphors for supporting better collaboration . In , they define a semantic web portal asany web portal that is developed based on semantic web technologies. They are in process ofdeveloping such web portal using available semantic technologies.Only standard technologies promising generic solution are selected. As result they expect thatthey will be able to provide basic development guidelines in the form of portal architecture anddesign patterns.In , they examine the development of web aggregators, entities that collect information from awide range of sources, with or without prior arrangements, and add value through post-aggregation services. New web-page extraction tools, context sensitive mediators, and agenttechnologies have greatly reduced the barriers to constructing aggregators. They predict thataggregators will soon emerge in industries where they were not formerly present.
72.4 Others Works on Web Aggregation and Mashups2.4.1 Mapping mashupsIn this age of information technology, humans are collecting a prodigious amount of data aboutthings and activities, both of which are wont to be annotated with locations. All of these diversedata sets that contain location data are just screaming to be presented graphically using maps.One of the big catalysts for the advent of mashups was Googles introduction of its Google MapsAPI. This opened the floodgates, allowing Web developers to mash all sorts of data onto maps.Not to be left out, APIs from Microsoft (Virtual Earth), Yahoo (Yahoo Maps), and AOL(MapQuest) shortly followed.2.4.2 Video and photo mashupsThe emergence of photo hosting and social networking sites like Flickr with APIs that exposephoto sharing has led to a variety of interesting mashups. Because these content providers havemetadata associated with the images they host (such as who took the picture, what it is a pictureof, where and when it was taken, and more), mashup designers can mash photos with otherinformation that can be associated with the metadata. For example, a mashup might analyze songor poetry lyrics and create a mosaic or collage of relevant photos, or display social networkinggraphs based upon common photo metadata (subject, timestamp, and other metadata.). Yetanother example might take as input a Web site (such as a news site like CNN) and render thetext in photos by matching tagged photos to words from the news.2.4.3 Search and Shopping mashupsSearch and shopping mashups have existed long before the term mashup was coined. Before thedays of Web APIs, comparative shopping tools such as BizRate, PriceGrabber, MySimon, andGoogles Froogle used combinations of business-to-business (b2b) technologies or screen-scraping to aggregate comparative price data. To facilitate mashups and other interesting Webapplications, consumer marketplaces such as eBay and Amazon have released APIs forprogrammatically accessing their content.
82.4.4 News mashupsNews sources (such as the New York Times, the BBC, or Reuters) have used syndicationtechnologies like RSS and Atom since 2002 to disseminate news feeds related to various topics.Syndication feed mashups can aggregate a users feeds and present them over the Web, creating apersonalized newspaper that caters to the readers particular interests. An example is Diggdot.us,which combines feeds from the techie-oriented news sources Digg.com, Slashdot.org, andDel.icio.us.2.5 Related TechnologiesA mashup application is architecturally comprised of three different participants that arelogically and physically disjoint: API/content providers, the mashup site, and the clients Webbrowser. The API/content providers. These are the providers of the content being mashed. Tofacilitate data retrieval, providers often expose their content through Web-protocols suchas REST, Web Services, and RSS/Atom. However, many interesting potential data-sources do not conveniently expose APIs. Mashups that extract content from sites likeWikipedia, TV Guide, and virtually all government and public domain Web sites do soby a technique known as screen scraping. In this context, screen scraping denotes theprocess by which a tool attempts to extract information from the content provider byattempting to parse the providers Web pages, which were originally intended for humanconsumption. The mashup site. This is where the mashup is hosted. Interestingly enough, just becausethis is where the mashup logic resides, it is not necessarily where it is executed. On onehand, mashups can be implemented similarly to traditional Web applications usingserver-side dynamic content generation technologies like Java servlets, CGI, PHP orASP.
10SOAP is a fundamental technology of the Web Services paradigm. Originally an acronym forSimple Object Access Protocol, SOAP has been re-termed Services-Oriented Access Protocol (orjust SOAP) because its focus has shifted from object-based systems towards the interoperabilityof message exchange. There are two key components of the SOAP specification. The first is theuse of an XML message format for platform-agnostic encoding, and the second is the messagestructure, which consists of a header and a body. The header is used to exchange contextualinformation that is not specific to the application payload (the body), such as authenticationinformation. The SOAP message body encapsulates the application-specific payload. SOAPAPIs for Web services are described by WSDL documents, which themselves describe whatoperations a service exposes, the format for the messages that it accepts (using XML Schema),and how to address it. SOAP messages are typically conveyed over HTTP transport, althoughother transports (such as JMS or e-mail) are equally viable.REST is an acronym for Representational State Transfer, a technique of Web-basedcommunication using just HTTP and XML. Its simplicity and lack of rigorous profiles set it apartfrom SOAP and lend to its attractiveness. Unlike the typical verb-based interfaces that you findin modern programming languages (which are composed of diverse methods such asgetEmployee(), addEmployee(), listEmployees(), and more), REST fundamentally supportsonly a few operations (that is POST, GET, PUT, DELETE) that are applicable to all pieces ofinformation. The emphasis in REST is on the pieces of information themselves, called resources.For example, a resource record for an employee is identified by a URI, retrieved through a GEToperation, updated by a PUT operation, and so on. In this way, REST is similar to the document-literal style of SOAP services.2.5.3 Screen scrapingLack of APIs from content providers often forces mashup developers to resort to screen scrapingin order to retrieve the information they seek to mash.Scraping is the process of using software tools to parse and analyze content that was originallywritten for human consumption in order to extract semantic data structures representative of thatinformation that can be used and manipulated programmatically.
11A handful of mashups use screen scraping technology for data acquisition, especially whenpulling data from the public sectors. For example, real-estate mapping mashups can mash for-sale or rental listings with maps from a cartography provider with scraped "comp" data obtainedfrom the county records office. Another mashup project that scrapes data is XMLTV, acollection of tools that aggregates TV listings from all over the world.Screen scraping is often considered an inelegant solution, and for good reasons. It has twoprimary inherent drawbacks. The first is that, unlike APIs with interfaces, scraping has nospecific programmatic contract between content-provider and content-consumer. Scrapers mustdesign their tools around a model of the source content and hope that the provider consistentlyadheres to this model of presentation. Web sites have a tendency to overhaul their look-and-feelperiodically to remain fresh and stylish, which imparts severe maintenance headaches on behalfof the scrapers because their tools are likely to fail.The second issue is the lack of sophisticated, re-usable screen-scraping toolkit software,colloquially known as scrAPIs. The dearth of such APIs and toolkits is largely due to theextremely application-specific needs of each individual scraping tool. This leads to largedevelopment overheads as designers are forced to reverse-engineer content, develop data models,parse, and aggregate raw data from the providers site.2.5.4 Semantic Web and RDFThe inelegant aspects of screen scraping are directly traceable to the fact that content created forhuman consumption does not make good content for automated machine consumption. Enter theSemantic Web, which is the vision that the existing Web can be augmented to supplement thecontent designed for humans with equivalent machine-readable information. In the context of theSemantic Web, the term information is different from data; data becomes information when itconveys meaning (that is, it is understandable).The Semantic Web has the goal of creating Web infrastructure that augments data with metadatato give it meaning, thus making it suitable for automation, integration, reasoning, and re-use.
12The W3C family of specifications collectively known as the Resource Description Framework(RDF) serves this purpose of providing methodologies to establish syntactic structures thatdescribe data. XML in itself is not sufficient; it is too arbitrary in that you can code it in manyways to describe the same piece of data. RDF-Schema adds to RDFs ability to encode conceptsin a machine-readable way. Once data objects can be described in a data model, RDF providesfor the construction of relationships between data objects through subject-predicate-object triples("subject S has relationship R with object O"). The combination of data model and graph ofrelationships allows for the creation of ontologies, which are hierarchical structures ofknowledge that can be searched and formally reasoned about. For example, you might define amodel in which a "carnivore-type" as a subclass of "animal-type" with the constraint that it "eats"other "animal-type", and create two instances of it: one populated with data concerning cheetahsand polar bears and their habitats, another concerning gazelles and penguins and their respectivehabitats. Inference engines might then "mash" these separate model instances and reason thatcheetahs might prey on gazelles but not penguins.RDF data is quickly finding adoption in a variety of domains, including social networkingapplications (such as FOAF -- Friend of a Friend) and syndication (such as RSS, which Idescribe next). In addition, RDF software technology and components are beginning to reach alevel of maturity, especially in the areas of RDF query languages (such as RDQL and SPARQL)and programmatic frameworks and inference engines (such as Jena and Redland).2.5.5 RSS and ATOMRSS is a family of XML-based syndication formats. In this context, syndication implies that aWeb site that wants to distribute content creates an RSS document and registers the documentwith an RSS publisher. An RSS-enabled client can then check the publishers feed for newcontent and react to it in an appropriate manner.RSS has been adopted to syndicate a wide variety of content, ranging from news articles andheadlines, changelogs for CVS checkins or wiki pages, project updates, and even audiovisualdata such as radio programs. Version 1.0 is RDF-based, but the most recent, version 2.0, is not.
13Atom is a newer, but similar, syndication protocol. It is a proposed standard at the InternetEngineering Task Force (IETF) and seeks to maintain better metadata than RSS, provide betterand more rigorous documentation, and incorporates the notion of constructs for common datarepresentation.These syndication technologies are great for mashups that aggregate event-based or update-driven content, such as news and weblog aggregators.2.6 Aggregation and Mashup ChallengesTo mashup and aggregate the web, it has its own challenges. The challenges can be divided intothree which is technical challenges, component challenges and social challenges.2.6.1 Technical ChallengesLike any other data integration domain, mashup development is replete with technical challengesthat need to be addressed, especially as mashup applications become more features andfunctionality rich.For example, translation systems between data models must be designed. When converting datainto common forms, reasonable assumptions often have to be made when the mapping is not acomplete one (for example, one data source might have a model in which an address-typecontains a country-field, whereas another does not). Already challenging, this is exacerbated bythe fact that the mashup developers might not be domain experts on the source data modelsbecause the models are third-party to them, and these reasonable assumptions might not beintuitive or clear.In addition to missing data or incomplete mappings, the mashup designer might discover that thedata they wish to integrate is not suitable for machine automation; that it needs cleansing.For example, law enforcement arrest records might be entered inconsistently, using commonabbreviations for names (such as "mkt sqr" in one record and "Market Square" in another),
16have to work together to assemble open standards and reusable toolkits in order to facilitatemature software development processes.Before mashups can make the transition from cool toys to sophisticated applications, much workwill have to go into distilling robust standards, protocols, models, and toolkits. For this tohappen, major software development industry leaders, content providers, and entrepreneurs willhave to find value in mashups, which means viable business models. API providers will need todetermine whether or not to charge for their content, and if so, how (for example, by subscriptionor by per-use). Perhaps they will provide varying levels of quality-of-service. Some marketplaceproviders, such as eBay or Amazon, might find that the free use of their APIs increases productmovement. Mashup developers might look for an ad-based revenue model, or perhaps buildinteresting mashup applications with the goal of being acquired.
17CHAPTER 3: METHODOLOGY3.1: Research ActivitiesTo complete this research, prototyping approach has been used in the system developmentprocess. From the research activities carried out and the development of the prototype, it issufficient to prove that conceptual idea will work satisfactory towards archieving the main ideaof web mashup and aggregation.The proposed research methodology will primarily focused on three main activities. Firstly,emphasis will be put on the identifying of the information model to be used for aggregating webdata. These include tools, platform and latest technology that will be used in the prototypedevelopment. It is important that the tools that involve are easy to use and has certain ability tojust not extracting data but more than that such as update the data as the client website changing.In this stage, object and class will be identifying to satisfied information that we need to harvest.Secondly, after information model is identified and tools is confirmed, an integration andcollaboration of the websites we be put on focus. Model and web bot (Kapow Robot) isdeveloped and deploy to harvest information that is need. This bots will bring back the data andthese are data that we will use.Lastly, interface will develop to be as the portal for information that has been collected by themashup robot.
183.2: Overview of Development ProcessA prototype will be developing in the research. Prototyping is the rapid development of a system.In the past, the developed system was normally thought of as inferior in some way to therequired system so further development was required. There are 5 steps in prototypedevelopment methodology. Below are the steps:1. Gather requirements2. Build prototype3. Evaluate prototype4. If accepted, throw away prototype and redesign5. if rejected, re-gather requirements and repeat from step 2Below is the illustration picture of prototype methodology:Figure 3.1: Prototype Model3.3 Gather requirementThere are a few methods that are used to gather requirement. In this case, we used internet,relevant papers, journals, and explore in the university library to find information. Basically,information about previous paper journal or researches about the topic are needed. Comes intothe prototype specifically, the seawares, hardware and technologies that suitable to develop theprototype are needed.EvolutionaryprototypingThrow-awayPrototypingDeliveredsystemExecutable Prototype +System SpecificationOutlineRequirements
20PointBase Server Version 4.4 and 4.5MySQL Version 4.0, 4.1 and 5.0APIsJava J2SE 1.3 + JAXP or J2SE 1.4 or later.NET C#, .NET Version 1.0 and 1.1Clipping PortletsBEA WebLogic Portal Version 8.1 (all service packs)IBM WebSpere Portal Version 5.0 and 5.1Standard Java Portal JSR-168Clipping BrowsersMicrosoft Internet Explorer Version 6.0Mozilla Firefox Version 1.5+ (both Windows and Linux)Tag libraryJSP Version 1.2 and 2.0Web ServicesBEA WebLogic Workshop Version 8.1 (all service packs).NET .NET Version 1.0 and 1.1Code GenerationJava J2SE 1.3 or later.NET C#, .NET Version 1.0 and 1.1Table 3.1: Software Requirements
21Hardware requirements:The table below specifies system specification for different platforms. The requirements maydepend on the application so these should only be taken as guidelines and not as absolutenumbers. A complex clipping solution might require much more power than a simple collectionsolution. The recommendations for servers are for one server. The number of servers used for agiven application (the size of a cluster) is a completely different matter and should be estimatedusing methods described elsewhere.Minimum RecommendedIDEWindows Intel Pentium 1GHz CPU,512MB RAM,200MB Free Disk SpaceIntel Pentium 2GHz CPU,1GB RAM, 200MBFree Disk SpaceLinux Intel Pentium 1GHz CPU,512MB RAM,200MB Free Disk SpaceIntel Pentium 2GHz CPU,1GB RAM, 200MB Free DiskSpaceServerWindows Intel Pentium 2GHz CPU,1GB RAM,200MB Free Disk SpaceIntel Pentium 2GHz CPU,2GB RAM, 200MBFree Disk SpaceLinux Intel Pentium 2GHz CPU,1GB RAM,200MB Free Disk SpaceIntel Pentium 2GHz CPU,2GB RAM, 200MBFree Disk SpaceSouce: http://kdc.kapowtech.com/documentation_6_4/Technical/TechnicalDataSheet6_4.pdfTable 3.2: Hardware RequirementsBeside information about software and hardware requirement, information about websites thatwill be the target for harvesting information also will be identified. Websites that have the abilityto do online booking/ticket system will be put on priority.
223.4 Build prototypeIn build the prototype, every aspect from installation and reading the manual must be preparedwell. The main tool to develop the prototype is Kapow Mashup Server. The other tools are IntelijIdea.3.2.1 Kapow Mashup ServerWhen we talk about web data access, extraction and harvesting, Kapow Mashup Server is a toolthat suitable to do all those things. Kapow also known as web integration platform and theKapow Mashup Server make it possible to access data or content from any browse-ableapplication or website. Over the past few years, the Kapow Mashup Server has become a newlightweight services and mashup standard among Internet-intensive businesses in the areas ofmedia, financial services, travel, manufacturing and information services firms (backgroundchecking, information providers, etc.).Kapow mashup server is a platform for web integration. Kapow mashup server helps transformthe resources of the web into well define nuggets of information and functionality. In effect,kapow mashup server transforms a web site into more services available to client application.According to source from , The Kapow Web Data Server powers solutions in web andbusiness intelligence, portal generation, SOA/WOA enablement, and content migration.Kapow’s patented visual programming and integrated development environment (IDE)technology enables business and technical decision-makers to create innovative businessapplications. With Kapow, new applications can be completed and deployed in a fraction of thetime and cost associated with traditional software development methods.The research using Kapow as the main application tools because Kapow have the best resultwhen it comes to data aggregation. By creating models and robots, all function such as collectionof internal or external web-based data sources, website clip, and so on can be done easily withoutdoing any programming.
23Few abilities of the Kapow Mashup Server that contributes to the development of the prototypeare:- Web integration- Code generation- Data harvestingThe Kapow Mashup Server also provides web-to-web data integration functionality, allowingdata extraction from one website, transforming it into a new format, and pushing it through inputforms into a second website. This process can be a many-to-many process, extracting data frommultiple websites, combining and transforming and pushing them into multiple other websites.Web based transformation is also supported e.g. using a website for real-time languagetranslation or HTML to XML conversion.3.2.2 Intelij IdeaTo code all web bases programming language such as Java, HTML and PHP, a platform isneeded. IntelliJ IDEA is a code-centric IDE focused on developer productivity. IntelliJ IDEAdeeply understands the code and gives a set of powerful tools without imposing any particularworkflow or project structure. Imagine that we have a large source code-base that we need tobrowse or modify it. For instance, we might want to use a library and find out how it works, orwe might need to get acquainted with existing code and to modify it. Yet another example is thata new JDK becomes available and we are keen to see the changes in the standard Java librariesand so on. Conventional tools like find and replace text may not completely address these goalsbecause when we use them, it is easy to find or replace too much or too little. Of course, ifsomeone already knows the source code well, then using the whole words option and regularexpressions may help make our find-and-replace queries smarter. This is an advantage why weuse this tool for the development of the prototype.
243.5 Evaluate prototypeTo be more specific, the prototype approach that we use is called extreme prototype. Basically, itbreaks down web development into three phases, each one based on the preceding one. The firstphase is a static prototype that consists mainly of HTML pages. In the second phase, the screensare programmed and fully functional using a simulated services layer. In the third phase theservices are implemented. The process is called Extreme Prototyping to draw attention to thesecond phase of the process, where a fully-functional UI is developed with very little regard tothe services other than their contract.In this stage, all robots and model that were created with the Kapow Mashup Server are ready tobe deployed. From the ability of kapow that allow generating codes, it will be copy and paste tothe programming development tool Intellij Idea.
25CHAPTER 4: PROTOTYPE DESIGN AND IMPLEMENTATIONThis chapter will explain and show the developer on actual concept of the research and howvarious measurements are taken to prove the concept. The whole implementation process of theaggregation and mashup tool, name as Kapow as Web Aggregation and Mashup for OnlineBooking System.4.1 Conceptual Design4.1.1 Kapow Mashup Server as ToolThe Kapow Mashup Server enable you to collect, connect and mashup everything on corporateintranets as well as the World Wide Web. Because of these abilities, the Kapow Mashup Serverhas made it the first choice to be use in the development of the prototype.Kapow Mashup Server provides web-to-web data integration functionality, allowing dataextraction from one website, transforming it into a new format, and pushing it through inputforms into a second website. This process can be a many-to-many process, extracting data frommultiple websites, combining and transforming and pushing them into multiple other websites.Web based transformation is also supported e.g. using a website for real-time languagetranslation or HTML to XML conversion.Figure 4.1: Diffrent layer involved in Kapow Mashup Server (kapow website)
26Figure 4.1 generally explain how the Kapow work. Tree layer are involved which is integrateddevelopment environment, web based management and scalable server environment. In the nexttopic, we will explain more details about each of these layers.There are 4 important elements that are involved in the integrated development environmentlayer. We can call this layer as the primary studio tools of Kapow Mashup Server.Figure 4.2: Kapow ModelMaker InterfaceModelMaker is the RoboSuite application for writing and maintaining domain models that areused in RoboMaker. With ModelMaker, we can easily create new domain models and configureexisting models, as well as add, delete, and configure the objects within a domain model.Meanwhile, RoboMaker is the RoboSuite application for creating and debugging robots.RoboMaker is an integrated development environment (IDE) for robots. This means thatRoboMaker is all we need for programming robots in an easy-to-understand visual programminglanguage. To support us in the construction of robots, RoboMaker provides us with powerfulprogramming features including interactive visual programming; full debugging capabilities, anoverview of the program state, and easy access to context-sensitive online help.
27Figure 4.3: Kapow Mashup Server RoboMaker Interface.4.1.2 Architecture of the PrototypeThe architecture of the prototype basically consists on 3-tier architecture. There will be anexisting online booking/ticket system on the internet that will be used as the data sources. Thesedata will be extracted by the kapow mashup server tools, to be specific kapow robot which isbuilt accordingly with the model. The robot will be deploy to harvest the data that we want andbring it back to the portal that is develop by using several web programing language and usingapache as the web server. Figure 4.4 show the architecture of the prototype.
294.2 Prototype Implementation Design4.2.1 WebsitesIn this thesis, for the purpose of testing we will only test the aggregator for one website. Thewebsite is www.Agoda.com. The website provides information about hotel. Information thatusually need for customer are the name of the hotel, rate per night, location and date. We areusing Kapow is the tool for extracting all those information.Websites have their own structure and design. Website often change their structure, layout anddesign so that it suits the current needs. The changing might be a problem to the prototype toadapt with because the data or information that is want to harvest might be change their locationor position in the website. This may course the robot bring back the wrong information.4.2.2 ModelModel Maker is used to create and edit object models. An object model is like a type definitionin a programming Language. It defines the structure of the objects that form the input and outputof a robot. model maker a; is a visual tool for creating data objects that define that data structuredutilized by robots for information collection, aggregation and integration.An object model consists of one or more attribute definitions, each of which define an attributename, type, and other information. A given robot will return (or store) objects defined by one ormore object models. For example, a data collection Robot for job postings could return objectsdefined by the object model Job. Job would contain attributes such as title and source (short texttypes), date (date type), description (long text) and so on. In case the objects are stored in adatabase at runtime, the database will have a table definition matching the object model. ModelMaker can generate the SQL necessary to create the required tables in the database.Firstly, a model for hotel need to be creates. Since this is an input and output type of query, weneed to create two object in the model. HotelQuery to input atributes country, city, checkindateand checkoutdate. HotelResult to extract output.
30See pictures how the model look likes.Figure 4.5 below show the object call HotelQuery with attributes call country, city, checkin, andcheckout.Figure 4.5: HotelQuery atributes
31Figure 4.6 show the output object – HotelResult with attributes.Figure 4.6: HotelResult as output object.
324.4: Creating Robot4.4.1: Creating Robot for Hotel WebsiteA robot will be creating to be deploying to harvest all information accordingly to the model.Steps to create the robot are as shown.Step 1: Choose Integration robot type of robot from New Robot WizardFigure 4.7: Create new Integration Robot
33Step 2: Enter the URL that the robot should start from: www.agoda.comFigure 4.8: Enter www.agoda.com
34Step 3: Select the objects that the robot should receive from input. Choose HotelQuery which iscreated in the model wizard.Figure 4.9: Choose HotelQuery
35Step 4: From the wizard, select the objects that the robot should return as output.Figure 4.10: Choose HotelResult
36Step 5: Select objects that the robot should use for holding temporary data during its execution.Figure 4.11: ScratchPad holding the temporary dataFigure 4.12: Two output objects HotelResult and ScratchPad
37Step 6: Entering the next attributeFigure 4.13: Loading website into the Kapow interface
41Step 10:Figure 4.17: Information that is collected
424.4.2: Creating Robot for Flight WebsiteAirasia.com is chosen to be the website where information about the flight is extract. Firstly,model will be creating as follow.Create a model for the robot. First create a model for flight_in.model which will collect all dataabout from where you want to fly. This will be as your input data. Add attributes below:1. Origin [data type: short text]2. Destination [data type: short text]3. Dep_date [data type: date]4. Ret_date [data type: date]Figure 4.18: Show how Flight_In attributes looks like.
43Create a model for flight_Out.model which will collect all data about the destination of yourflight. This will be your output data. Add attributes for flight_out.model as below list:1. Origin [data type: short text]2. Destination [data type: short text]3. Dep_date [data type: date]4. Arr_date [ data type: date]5. Flight_no [data type: date]6. Price [data type: number]7. Currency [data type: short text]8. Carrier [data type: short text]Figure 4.19: Show how Flight_Out attributes looks like.Save model as flight.model.
44Creating airasia.robotFirst, open RoboMaker application.Figure 4.20: Creating Airasia.robotChoose Create a new robot…And click OK.
45Figure 4.21: Choose Integration robot then, click NEXTFigure 4.22: Enter the URL that the robot should start fromhttp://www.airasia.com/site/my/en/home.jspIntegrationrobothttp://www.airasia.com/site/my/en/home.jsp
46Figure 4.23: Select the objects to input to the robot.Click here to add Flight_In
47Figure 4.24: Select the objects to output from the robot.Click FINISHFigure 4.25: This is how the first screen looks like, Load Page.Select Flight_Out foroutput object
48Move your cursor to the origin and right-click your mouse. As shown in Figure 4.26, click on―Select Option‖.Figure 4.26: Select Option for Origin.A pop-up screen will appear and choose from drop-down menu for your Origin and set it asValue. In this tutorial, choose Kuala Lumpur LCCT.Click On Select Option
49Figure 4.27: Option to Select.Do the same way to destination which is in this case Bintulu will be the destination.Figure 4.28: Set the destination.Select Origin :Kuala LumpurLCCTSet as valueSelect Bintulu as thedestination
50Next, we need to set the Departure date for the flight. The date will be extract toFlight_In.dep_date. Put your cursor on the Departure Date and right-click on it. Select Optionand choose the date of departure. See Figure 4.29 and Figure 4.30.Figure 4.29: Select Option for date of departure.Figure 4.30: Select day of departure.Right-click on date and select option
51Day must be inserted as what it is. To do that, we need to convert the full date format to extractonly day from it. See Figure 4.31.Figure 4.31: Select ConvertersFigure 4.32: Get attributeSelect convertersClicks configureto configure GetAttribute.Configure it asshown in Figure4.33
52Figure 4.33: Set attribute as Flight_In.dep_date.However, value is not appearing there until you set the value by your own in all objects ofFlight_In. The values are as below (see Figure 4.34):Flight_In.Origin = Kuala Lumpur LCCTFlight_In.Destination = BintuluFlight_In.dep_date = 2009-08-01 00:00:00.0Flight_In.ret_date = 2009-08-05 00:00:00.0
53Figure 4.34: Setting attributesAfter that, click on ―Configure ―to set the day for Flight_In.dep_date as shown in Figure 4.35.Figure 4.35: Set attribute as Flight_In.dep_dateFill attributes withinformation asshownAfter fill all attributes,click applyAttribute: Flight_In.dep_date
54Click on to add Converter and select Date Handling and choose Format Date as shown inFigure 4.36.Figure 4.36: Formatting date.Configure Format Date as shown in Figure 4.37.Click on Format Date
55Figure 4.37: Format pattern.Do the same steps for Month and Year of the departure date. However, in this step choose ―Aug2009(―200908‖) as the Option to select and set it as value. See Figure 4.38.Figure 4.38: Date formatRepeat step in Figure 4.30, Figure 4.31, Figure 4.32, Figure 4.33, Figure 4.34, Figure 4.35,Figure 4.36, Figure 4.37, and Figure 4.38 to set a Day, Month and Year for Flight_In.ret_date.In this case use 05 August 2009 as the return date.Change FormatPattern to “dd”
56Next steps put your cursor on ―Search’ button as shown in Figure 4.39. Right-click on it andchoose ―Click‖.Figure 4.39: Choose click to search for flight.After click on search button, a screen which is the result of searching will appear. Click on thetable of the flight information and expand to create loops.Choose “Click”
57Figure 4.40: Creating loopsFigure 4.41: First tag finderNext steps is extracting information for Flight_Out object.Step 1: Click on tableStep 2: Expand the Green Line square to get loopsStep 3: Right-click inside greed square,choose Loops and select Far Each TagStep 1: Click back to loopsStep 2:Replace 0to 1Step 3: Click here
58Figure 4.42: Extracting to Flight_Out.originConfigure Extraction using Advance Extract as seen in Figure 4.43.Figure 4.43: Configure extraction by using Advance Extract.Step 1: Click onthe Kuala Lumpur,Expand it. Right-click on the words.Select Extraction,Select Text andchooseFlight_Out.OriginSelect Advance Extract
59Figure 4.44: Patten, Output ExpressionNext step is to extract the departure date. See steps in pictureFigure 4.45: Steps to extract date of departure to Flight_Out.dep_date.ClickConfigure toconfigurethe wordsUse this pattern:.*to(.*)(.*Output Expression:$1Step 1: right-click on the hourStep 2: Choose Extraction => ExtractDate => Flight_Out.dep_date
60Next step is to configure the date format. See Figure 4.46.Figure 4.46: Date FormatDo the same steps to extract arrival date and save it to Flight_Out.arr_date. See steps in Figure4.47.Format Pattern: hhmm
61Figure 4.47: Extracting arrival dateFigure 4.48: Set the Format Pattern of the date as hhmm as well.Step 1: Right-click onarrival hourStep 2: Choose Extraction => Extractdate => Flight_Out.arr_date.
62As we can see at the browser tables of information about Departure have all this (Figure 4.49).We need to extract Depart (0905) to Flight_Out.dep_date, Arrive(1100) to Flight_Out.arr_datewhich we have already done in previous steps. Now we need to extract Flight (AK 5146) toFlight_Out.flight_no, Fare (156.00) to Flight_Out.price and Currency (MYR) toFlight_Out.currency.Figure 4.49: Depart tableExtracting Flight Number, see Figure 184.108.40.206: Extracting the flight number.
63Extracting price, see Figure 4.51.Figure 4.51: Extracting price to Flight_Out.price.Extracting the currency to Flight_Out.currency. See Figure 4.52.
64Figure 4.52: Extracting currency.For currency, set it as Advance Extract. See Figure 4.53.Figure 4.53: Format for Advance Extract.Step 1: Click here AddAdvance ExtractStep 2: Click here toconfigureStep 3: Setpattern as.* (.*)Step 3: SetOutputExpressionas $1
65At the end of the robot, we must return the object to itself. See Figure 4.54.Figure 4.54: Returning object.For Return table, we need to extract some information like we did for Depart. Apply all stepsthat used in the Table Depart.Figure 4.55: Return table.Creating branch and loops for table Return. See Figure 4.56 and Figure 4.57.Choose return objectto Flight_Out
66Figure 4.56: Creating branch.Figure 4.57: Creating loops for Return table.Step 1: clickhereStep 2: Clickhere to createbranchNew branchwill appear.Step 1: Click anyarea in side tableStep 2: Expand thegreen square tosatisfy the table forloop Step 3: Right-click onthe green box area.Step 4: Choose ForEach Tag loops.
67After this steps, follow steps that we applied for Depart Table.Last but not least, the robot must be debugging. Click on debug icon to debug the robot. SeeFigure 4.58.Figure 4.58: DebuggingThe debugging screen will appear after you click debug icon. See steps in Figure 4.59 how to runthe debugging.Figure 4.59: Debug screenClick here todebug.Click here to run thedebugging.Information collectedby your robot.
684.4.3 Intellij SettingFor the prototype, the setting is need. The general setting is as shown in the picture.Figure 4.60: Path of the project compilerFrom the picture, the path of the project compiler output needs to be set. This path is use to storeall project compilation results. A directory corresponding to each module is created under thispath. This directory contains two subdirectories: Production and Test for production code andtest sources, respectively. A module specific compiler output path can be configured for each ofthe modules as required. In this case, directories call ―workspace‖ and subdirictories ―hotel‖ and―out‖. The path is C:workspacehotelout.
69Figure 4.61: Setting classesClasses for the project also need to be set. Attach the classes to C:Program FilesKapow MashupServer 6.4APIrobosuite-java-apilibrobosuite-api.jar. This is to link the Intellij IDEA to KapowRobosuite API.
70CHAPTER 5: FUTURE ENHANCEMENT AND CONCLUSION5.1: Future EnhancementFrom earlier, this paper introduces a way to collaborate and aggregate information from severalonline tickets booking services website such as hotels, airlines and tickets booking. Theemphasize is on the technique and way to aggregation with Kapow Mashup. The prototype thatis developed is use only one website which is the hotel website. In the future, the websites couldbe added. More research also could be done on how to make a comparison and use the data thatis collect by the robot and applying it for data mining purposes.The area of Web services aggregations is seeing a large amount of activity as aggregationmechanisms are still evolving. Some are being extended and new ones created to enhance theircapabilities. As multiple proposals emerge for aggregating Web services, it is important tounderstand where the mechanisms needed fit in and how they relate to existing approaches.Ongoing work will reflect the effects of the evolution of core specifications, including WSDL, aswell as the growth and adoption of Web services aggregation techniques. Refining andexpanding the classification will consider both adding categories, and additional dimensions forexisting categories, such as level and focus of constraints. We are also interested in identifyingprimitive aggregation mechanisms, and understanding the conditions under which they may ormay not be combined.The World Wide Web contains an immense amount of information, thus it is nowadays oftenthought of as a huge database. However, like for relational databases, a database managementsystem (DBMS) is needed to combine data from different sources and give the information anew meaning. In above sections API driven Mashup building was introduced as a way of mixingup data from different Web sources just like combining data from different tables in a relationaldatabase, which provided a way of managing information stored in the database we call theWorld Wide Web. Building Mashups using API’s require high programming skills though and sothey are quite useless for a regular person, who wants to mix up data sources from all over theweb. Another point is that most information on the Web is not accessible over an API, so only asmall part of the WWW is remix-able. In , the vision of gathering data for Mashups easier inthe future is stated.
715.2 ConclusionEveryday information will keep on added into websites throughout the world as long as there isan access into the World Wide Web. People can rely on the internet whenever they needinformation. Just one click into the net, they can have the information that they want. Themassive information and data on the internet need to be exploited and change it into usefulinformation. Assume that, website is databases consist of tables and using the website aggregatortools we can query data from the website. This paper described how mashup technique can beused to solve specific service issues for end users. In relation with this issue, a mashup techniqueis proposed using tool that called Kapow Mashup Server. It is also described the relevanttechnologies that can be used for mashup in different service layers. This type of architecture canleverage and integrate the end user relevant information from the existing web applications in theweb.
72REFERENCES1. Mustafa Jarrar, Marios D. Dikaiakos: A Data Mashup Language for the Data Web2. Bizer C, Heath T, Berners-Lee T:Linked Data: Principles and State of the Art. WWW(2008)3. Ainie Zeinaida, Nor Adnan Yahaya: Design and Implementation of an Aggregation-based Tourism Web Information System4. Marja-Riita Koivunen: Annotea and Semantic Web Supported Collaboration5. Lidia Rovan: Realizing Semantic Web Portal Using Available Semantic WebTechnologies and Tools6. Stuart Madnick, Michael Siegel: Seizing the Opportunity: Exploiting Web Aggregation7. http://queue.acm.org/detail.cfm?id=10170138. http://www.langpop.com/. Retrieved 2009-01-16.9. http://www.thirdnature.net/about_us.html10. F. Curbera, M. Duftler, R. Khalaf, N. Mukhi, W. Nagy, and S. Weerawarana. BPWS4J.Published online by IBM at http://www.alphaworks.ibm.com/tech/bpws4j, Aug 2002.11. Fancisco Curbera, Matthew Duftler, Rania Khalaf, William Nagy, Nirmal Mukhi, andSanjiva Weerawarana. Unraveling the web services web: An introduction to SOAP,WSDL, and UDDI. IEEE Internet Computing, 6(2):86–93, 1 2002.12. Francisco Curbera, Rania Khalaf, Frank Leymann, and Sanjiva Weerawarana. Exceptionhandling in the bpel4ws language. In International Conference on Business ProcessManagement (BPM2003), LNCS, Eindhoven, the Netherlands, June 2003. Springer.13. Francisco Curbera, Rania Khalaf, Nirmal Mukhi, Stefan Tai, and S. Weerawarana. Webservices, the next step: Robust service composition. Communications of the ACM:Service Oriented Computing, 10 2003.
7314. Francisco Curbera, Sanjiva Weerawarana, and Matthew J. Duftler. On componentcomposition languages. In Proc. International Workshop on Component–OrientedProgramming, May 2000.15. Eric M. Dashofy, Nenad Medvidovic, and Richard N. Taylor. Using off-the-shelfmiddleware to implement connectors in distributed software architectures. In Proc. ofInternational Conference on Software Engineering, pages 3–12, Los Angeles, California,USA, May 1999.16. Iskold, A. Yahoo! Pipes and the Web as Database. Available athttp:// www.readwriteweb.com/archives/yahoopipesweb-database.php. (Accesed on01/01/2010)17. A Mashup Architecture for Web End-user Application Designs Shah J Miah I and JohnGamlnack 2, Institute for Integrated and Intelligent Systems, Griffith University, NathanCalnpus, QLD 4111, Australia18. The RDF Book Mashup: From Web APIs to a Web of Data Christian Bizer, RichardCyganiak, and Tobias Gauß, Freie Universit¨at Berlin19. http://kapowtech.com/index.php/about-us/overview
74GLOSSARYWorld Wide Web(WWW)The World Wide Web, abbreviated as WWW and commonly knownas the Web, is a system of interlinked hypertext documents accessedvia the Internet. With a web browser, one can view web pages thatmay contain text, images, videos, and other multimedia and navigatebetween them by using hyperlinks.Data Web Data Web refers to the transformation of the Web from a distributedfile system into a distributed database system.Web 1.0 Web 1.0 (1991-2003) is a retronym that refers to the state of theWorld Wide Web, and any website design style used before theadvent of the Web 2.0 phenomenon. Web 1.0 began with the releaseof the WWW to the public in 1999, and is the general term that hasbeen created to describe the Web before the "bursting of the Dot-combubble" in 2001.Since 2004, Web 2.0 has been the term used to describe the currentweb design, business models and branding methods of sites on theWorld Wide Web.Web 2.0 The term Web 2.0 is commonly associated with web applications thatfacilitate interactive information sharing, interoperability, user-centered design, and collaboration on the World Wide Web. A Web2.0 site gives its users the free choice to interact or collaborate witheach other in a social media dialogue as creators (prosumer) of user-generated content in a virtual community, in contrast to websiteswhere users (consumer) are limited to the passive viewing of contentthat was created for them. Examples of Web 2.0 include social-networking sites, blogs, wikis, video-sharing sites, hosted services,web applications, mashups and folksonomies.APIs An application programming interface (API) is an interface
75implemented by a software program that enables it to interact withother software.SOA Search oriented architecture, the use of search engine technology asthe main integration component in an information systemAnnotea In metadata, Annotea is an RDF standard sponsored by the W3C toenhance document-based collaboration via shared document metadatabased on tags, bookmarks, and other annotations.Semantic Web Semantic Web is a group of methods and technologies to allowmachines to understand the meaning - or "semantics" - of informationon the World Wide Web.RSS RSS (most commonly expanded as Really Simple Syndication) is afamily of web feed formats used to publish frequently updatedworks—such as blog entries, news headlines, audio, and video—in astandardized format.ATOM The name Atom applies to a pair of related standards. The AtomSyndication Format is an XML language used for web feeds, whilethe Atom Publishing Protocol (AtomPub or APP) is a simple HTTP-based protocol for creating and updating web resources.REST Representational State Transfer (REST) is a style of softwarearchitecture for distributed hypermedia systems such as the WorldWide Web.Java servlets A Servlet is a Java class in Java EE that conforms to the Java ServletAPI, a protocol by which a Java class may respond to HTTP requests.CGI Common Gateway Interface, a protocol for calling external softwarevia a web server to deliver dynamic content (and .cgi, its associatedfile extension)PHP PHP: Hypertext Preprocessor is a widely used, general-purposescripting language that was originally designed for web developmentto produce dynamic web pages.ASP Active Server Pages, a web-scripting interface by Microsoft.
77Internet standards, cooperating closely with the W3C and ISO/IECstandards bodies and dealing in particular with standards of theTCP/IP and Internet protocol suite.ActiveX ActiveX is a framework for defining reusable software components ina programming language independent way. Software applications canthen be composed from one or more of these components in order toprovide their functionality.WOA Web Oriented Architecture, a computer systems architectural style.IDE An integrated development environment (IDE) also known asintegrated design environment or integrated debugging environmentis a software application that provides comprehensive facilities tocomputer programmers for software development.DBMS A Database Management System (DBMS) is a set of computerprograms that controls the creation, maintenance, and the use of adatabase.