Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Visual Analytics as a Service

102 views

Published on

A drag-and-drop tool for a creating script for the Apache Spark. Copyright Ikwhan Chang, Ashwini Chellagurki, Keyur Golani
Venkata Narasa Kumar Kuchi

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Visual Analytics as a Service

  1. 1. Visual Analytics as a Service A Project Report Presented to The Faculty of the College of Engineering San Jose State University In Partial Fulfillment Of the Requirements for the Degree Master of Science in Computer Engineering Master of Science in Software Engineering By Ikwhan Chang Ashwini Chellagurki Keyur Golani Venkata Narasa Kumar Kuchi Dec 2017
  2. 2. Copyright © 2017 Chang, Ikwhan Chellagurki, Ashwini Golani, Keyur Venkata Narasa Kumar Kuchi ALL RIGHTS RESERVED
  3. 3. APPROVED Charles Zhang, Project Advisor Dr David Bruck, Director, MS Computer Engineering Dan Harkey, Director, MS Software Engineering Dr Xiao Su, Department Chair
  4. 4. ABSTRACT Visual Analytics as a Service By Ikwhan Chang; Ashwini Chellagurki; Keyur Golani; Venkata Narasa Kumar Kuchi Engineering industry and specifically the computer engineering industry has always valued the data. Earlier days of collecting data were through the surveys, forms, bulk experiments and other cumbersome methods. With the evolution of computer industry, the sources of the data evolved rapidly, and the data started being used to solve business intelligence problems becoming the primary business strategy for almost all companies, organizations and individuals. Presently, almost all companies are using machine learning algorithms, data wrangling and insights from both of those to solve problems that cannot be solved by humans in realistic time frame or with use of realistic effort. With the growth of data, the analytics is growing rapidly. With current systems, retrieval of the results of analytics and machine learning is not easier to visualize without coding. Different types of data and different inferences from the data pose challenge to visualize and make sense out of the plethora of information that we possess right now. One need to have better understanding of what machine learning algorithms to be applied with the advantage of one over another. Some code needs to be written for the algorithms to see their relevant results, and compare them, to fine tune them and in general to ensure the effectiveness of the processes used in business production environment. This process takes some time with prior software and coding skill. There is a need of highly skilled developers to leverage the analytic solutions, or machine learning models. In proposed system, the user can use a web-based visual data analytics and machine learning procedure designer with a user interactive interface. Users can simply drag and drop objects or components, such as inputs or outputs, or popular machine learning algorithms that are needed for a procedure. Regardless of the library, proposed back-end system will run optimized services and provide independent URLs. With this visual interaction, the results are dynamically displayed without need of code from user.
  5. 5. v Acknowledgments The authors are deeply indebted to Professor Charles Zhang for his invaluable comments and assistance in the preparation of this study.
  6. 6. vi Table of Contents Chapter 1. Project Overview.............................................................................................1 Introduction .................................................................................................................1 Proposed Areas of Study and Academic Contribution................................................3 Current State of the Art ...............................................................................................5 Chapter 2. Project Architecture .......................................................................................8 Introduction .................................................................................................................8 Architecture Subsystems .............................................................................................9 1) Visual Analytics GUI web program................................................................9 2) Docker-based cloud backend service ............................................................10 Chapter 3. Technology Descriptions ..............................................................................12 Drag and Drop ...........................................................................................................12 Authentication ...........................................................................................................12 File Upload ................................................................................................................12 File Download ...........................................................................................................13 Backend Routing ...............................................................................................14 Database Connections ...............................................................................................15 Distributed Processing Frameworks..........................................................................15 Chapter 4. Project Design ...............................................................................................17 GUI Design (View) ...................................................................................................17 Controller Architecture..............................................................................................26 Authentication Procedures.........................................................................................29 Frontend Router Definitions......................................................................................30 Backend Router Definitions ......................................................................................30 Database Schema.......................................................................................................34 Python Services Interface to Frameworks.................................................................35 Chapter 5. Project Implementation................................................................................37 Authentication ...........................................................................................................37 Visualization Tools....................................................................................................40 Backend .....................................................................................................................48 Chapter 6. Testing and Verification...............................................................................50 Unit Test ....................................................................................................................50 UI Testing..........................................................................................................50 Load Testing......................................................................................................51 Performance Testing..........................................................................................52 Authentication Test ...........................................................................................52 Data Transaction via Router Test ......................................................................53 Integrated Test ...........................................................................................................53
  7. 7. vii Test Scenario .....................................................................................................53 Test Result .........................................................................................................55 Chapter 7. Performance and Benchmarks ....................................................................56 Chapter 8. Deployment....................................................................................................58 Deployment plan .......................................................................................................58 Chapter 9. Summary, Conclusions, and Recommendations........................................60 Summary....................................................................................................................60 Recommendations for Further Research ...................................................................61 Glossary.............................................................................................................................63 References.........................................................................................................................65 Appendices........................................................................................................................66 Appendix A. ..............................................................................................................66
  8. 8. viii List of Figures Figure 1. Project Architecture .............................................................................................8 Figure 2. Login Template..................................................................................................17 Figure 3. Login Template..................................................................................................18 Figure 4. Main Visualized Tool.........................................................................................19 Figure 5. File Uploads & Preview.....................................................................................20 Figure 6. Manipulate the File will be uploaded.................................................................21 Figure 7. Add new row GUI..............................................................................................21 22 Figure 8. Drag and Drop....................................................................................................22 Figure 9. Draw the Lines Between Nodes.........................................................................23 Figure 10. Dialog to Look and Modify the Input File.....................................24 Figure 11. Dialog to Make User’s Custom Script (e.g. Filter)........................25 Figure 12. Dialog for Output...........................................................................25 Figure 13. Facebook OAuth............................................................................29 Figure 14. Sign- In...........................................................................................29 Figure 15. Database Schema ...........................................................................34 Figure 16. Chrome’s Local Storage Content...................................................40 Figure 17. Test Scenario..................................................................................54 Figure 18. Deployment Plan............................................................................58
  9. 9. ix List of Tables Table 1. Frontend Router Definitions ..............................................................................30 Table 2. Backend Router Definitions...............................................................................30 Table 3. Part of Facebook Login Function ......................................................................38 Table 4. Part of Sign Up Function ...................................................................................39 Table 5. Part of Navigation DOM Source Code ..............................................................41 Table 6. Part of Navigation Controller Source Code .......................................................44 Table 7. Part of Main Controller Source Code (Link Creation) ......................................44 Table 8. Part of Main Controller Source Code (Load LocalStorage) ..............................44 Table 9. UI Testing ..........................................................................................................51 Table 10. Test Input ...........................................................................................................55 Table 11. Performance and Benchmarks ...........................................................................57 Table 12. Glossary .............................................................................................................64
  10. 10. 1 Chapter 1. ProjectOverview Introduction Users (developers, statistical researchers, analytics enthusiasts) who wants to use analytics or machine learning on the datasets, typically has to follow a big process. The users must install the required libraries and configure their system with analytics, machine learning libraries and its dependencies to use them which takes cumbersome time, resources. Usually these libraries are in form of frameworks which require a lot of system environment setup and dependencies installation. They don’t have an easy to follow installation process and many times the installation process differs according to the existing environment setup on developer machine or cluster. The users must analyze the dataset manually or use another tool by setting up in a system. To utilize data for machine learning, data preprocessing is an important step to get accurate results. For the data preprocessing, the dataset to be checked and take appropriate steps. There are many tools available for analyzing data with tabular data visualization with different filters that can be applied to the data. Often the analysis of data with analytics tool or using machine learning on the data needs a complete understanding of the tool or algorithm being used. Different analytics and machine learning frameworks use different data platforms, types and formats. Different tools and wrappers use different frameworks under the hood. It involves coding in the language supported by the tool or library. It takes relevant skill in the tool before beginning the development and seeing the result. If the user is not satisfied with the library or algorithm with respect to accuracy or result, the user must follow similar steps with different library or tool. This process is never ending until the user is happy with the result.
  11. 11. 2 To top it off, if the user is a machine learning researcher, data scientist or data engineer, this process is never ending because a part of their job is to try different frameworks, wrappers and platforms to enable their developed application to be the best and most efficient at accuracy and performance. The outcome of the project is a highly scalable SaaS (Software as a Service) web application which provides a flow editor for connecting different nodes. The nodes include different input nodes (data source with data formats JSON, CSV), output nodes (visualization nodes, output data formats, data layouts), and other nodes which include machine learning algorithms from various libraries (Apache Spark, Tensor Flow, etc.). With the help of the GUI-based web application, the users need not install any software or a client software. Another advantage is that the user does not need to rely on heavy processing power or on premise large memory to solve their big data problem. The solution being a cloud software takes care of this constraint by being elastic in nature and using the backbone of deployed cloud clusters in its backend. The SaaS application can be accessed anywhere via the internet. The application is user intuitive and easy to use. The users can use the drag and drop functionality to interact with the application. Each node can have its own configuration which can be done using human readable form type interface in order to give the user freedom to tweak the process and node behavior. There is no need to code and the connectivity between nodes will convert into appropriate data modeling or coding on the backend system leaving the user to deal with the analytics, results. This system reduces their development time, resources and increases their
  12. 12. 3 productivity. Since this system will contain many libraries, the users will have a choice of choosing algorithms from any available library, quickly compare the results of once library over the other without knowing the library, or coding. By going through this process, a user can import a data source file and analyze its data with visualization chart and analytics with ease. The project makes use of all open sourced libraries and software to develop a non- commercial system with low cost. This system can be made available as open source project. Proposed Areas of Study and Academic Contribution In the field of Computer Engineering data has always been a priority, also having accurate data while doing prediction is directly proportional to the quality of the results obtained and the list depicting the importance of data is endless. In the earlier days, in order to collect accurate data, we had to undergo a tedious process and some of the ways to collect data was through surveys, forms, bulk experiments and other cumbersome methods. The software tools that were used earlier to visualize and manage data cannot be used now as they can no longer can process huge data effectively which is becoming a big challenge to companies as they are required to process huge data effectively in a short period of time (Aziz Nasridinov et al., (2013)) [1]. With the evolution of computer industry, even the sources for data started to evolve rapidly and many companies now are using machine learning algorithm in order to solve complex problems. All the problems related to big data can be solved using visualization
  13. 13. 4 and data analytics along with the use of adequate, right and useful tools will help extract useful data (Daniel Keim et al., (2013)) [2]. Visualization can also be extended to space called hybrid-reality environments where the environments extract the capabilities of Virtual Reality and high-resolution LCD walls (Khairi Reda., (2013)) [3]. There is a rapid growth of Visual analytics in the industry today which combines visualization of huge data along with our analytical reasoning approach which helps us visualize huge volumes of data and gain useful insights from it (Marc Streit et al., (2013)) [4]. Many large-scale scientific computing projects have a common data archive, modeling tasks etc but the lack of standardization and different formats of data is making collaboration difficult (Dan Cheng et al., (2006)) [5], the cost due to duplicate implementation is also increasing rapidly. With the rapid growth in data, use of machine learning models is rapidly increasing but in order to use them, the developer needs to have a prior and in-depth understanding of them, the demand for predictive models is increasing in order to analyze huge volumes of data so that useful insights can be obtained. An example of the same being a healthcare institution where a predictive model can be applied to the patient’s database and the high- risk patients can be given priority. Building complex models would be difficult and that data scientists often turn to machine learning to build such models. (Josua Krause et al.,2015) [6]. So, there is a need for a web-based visual machine learning procedure designer with a user interactive GUI where users can simply drag and drop the components such as inputs
  14. 14. 5 or machine learning algorithms that are needed for a procedure and due to the visual interaction, the results are dynamically displayed without the need of code from the user. Current State of the Art The data extracted from the dataset can be visualized using an intuitive web application being built in this project. The drag and drop functionality with the connectivity of components (input source, output source, algorithms, etc.) by the user enables data model for analyzing data. The current web application is inspired from the Node-RED application. Node-RED is an open source application to connect IoT devices, data sources, and to redirect the output to different sources. This similar functionality will make the users to use with ease. The user can connect available nodes sequentially to make a model. This model drawn by the user on the GUI will be converted to code in the backend system to execute the user chosen analytics. Data transformation functions like map, reduce, filter are available as service nodes to connect to data nodes. The user can also leverage various machine learning algorithms from different libraries. The supported machine learning algorithms are discussed below. There are multiple libraries that are available under open source which are providing many of the machine learning algorithms. They are namely Apache Spark, Tensor Flow, and scikit-learn. Apache Spark library is a top-notch project by Apache which develops and contributes open source libraries. Spark is used for data transformation and it has MLib library for machine learning algorithms. TensorFlow is an open source product from Google, which was developed mainly for image recognition, and video based data
  15. 15. 6 sources. GPU is a very important aspect for processing the data when compared to CPU. Scikit-learn is also an open source library developed for Python programmers and is providing vast algorithms in machine learning. To interact with these frameworks in the backend infrastructure, python micro- services using Bottle framework ecosystem will be available. This will have different micro-services for data upload, download, user account and profile management and data and node management. However, the most important one of all would be the processing service which would take the ready node chains as input and process the mentioned dataset using the nodes mentioned in the chain. These nodes will be predefined execution logics in service which can take different parameters from the user as input and process the given dataset accordingly. The dataset will also appear to the user as a node and will have to be uploaded by the user before-hand into his / her account in order to process it. This makes the design of the processing service very flexible and adaptable. In future, if a user needs to add a custom logic, he can either use the provided custom node or map, reduce, filter nodes to write his or her own logic to be executed. For Big Data batch processing Akka is being used. Docker is used for micro services. Kubernetes is used to manage the Dockers being created. Web Sockets (socket.io) will be used for communication between Visual Analytics tool (Web GUI) and the backend cloud service. A NoSQL database (MongoDB) will be used for saving user's Machine learning logics and leveraging the data batch processing. The web GUI will be developed using MEAN stack (MongoDB, Express.js, Node.js, and Angular.js). Ajax and Angular.js
  16. 16. 7 will be used for extensive client-side scripting to improve the user experience of the web application. There are several tools available for analytics like R programming used by statistical researchers which require extensive programming. Without code, the analytics simply won’t work. This GUI based application will enable users to do upload their dataset as input data node and apply some analytics functions to filter data and view various visualization models against the data.
  17. 17. 8 Chapter 2. ProjectArchitecture Introduction The architectural design is conceived as below. The communication between backend and frontend is done through WebSockets, through Node.js and socket.io in the frontend, and via WebSocket (WS) in Akka HTTP in the backend. In the GUI, when the user's drag-and-drop is completed, and finally the machine learning logic comes out, it is converted into a script upon execution the ML logic. The transformed logic packetizes and requests the result back to the backend server via socket.io in JSON format. Figure 1. Project Architecture The server is mainly composed of Akka HTTP and Akka Actor. All Akka components run on the Play Framework, which can be depreciated if it is not needed later.
  18. 18. 9 Akka HTTP exists to communicate all layers, and our program exists to provide WebSockets. All WebSocket requests are bypassed to the Akka Actor, and the Akka Actor acts as a Message queue within our program. The Akka actor calls the corresponding handler through the state and behavior of the request sequentially. For example, when the user calls Execute at the GUI level, the JSON file of the script is transmitted. In Akka HTTP, a request of ID = EXECUTE_ML having JSON is transmitted to the Akka Actor. Akka Actor invokes the corresponding handler through the ID. In our program, the ML Manager module is called to analyze the actual user's request through the body and header of the request and select the appropriate machine learning library. The corresponding machine learning libraries are all docked, Kubernetes is managed as a single IP through the service, and scale is automatically managed for each library worker. The status of the worker is checked periodically by the Akka Actor through the batch system, and the confirmed work is delivered to the GUI (Socket.io) periodically via WS. Socket.io allows the user to check the status and results in real time through the URL through the corresponding handler for the response received. Architecture Subsystems There is a total of two deliverables in our program, and the architecture for each is as follows. 1) Visual Analytics GUI web program The program consists of Node.js and Angular.js, and follows MVVC architecture of AngularJS. The program itself consists of a service, a controller, a factory, and a view, and manages the routing of each service in the main script. The service manages session
  19. 19. 10 and web storage, and sends and receives Ajax to the main app. The controller manages the view models that are 2-way bound directly, and bind logic that handles user actions (e.g., drag and drop). We will manage all the endpoints in a single file through our own built-in Restful factory, which will manage the routing steps in the backend, not the front-end and the back-end, in Node.js. For example, example.com/dashboard is directly managed by AngularJS, but routing like example.com/api/:user/:input (colon means URL parameter) is managed through Express in Node.js. At the back end, the database is managed through MongoDB, and user login sessions are cached and managed through Redis. In addition, if you need push notification in addition to Ajax, you can create a channel through socket.io to enable real-time processing. 2) Docker-based cloud backend service To implement the logic, we created in (1), we create a back-end service using an Actor model based on Play Framework and Scala and Akka. Basically, we use big data model of batch processing model and design three Akka services: A. A service that hooks up the stored analytics and ML logic of the user B. A service that runs the logic that was called in A. C. A service that periodically sends the result of the processing of B (1) and the hashed URL. As a basic program scenario, once the visual analytics program is done, it will be duplicated to the master database of MongoDB, and it is hooked continuously for 5 seconds in the batch processing service based on Akka, the cloud service will use the customized
  20. 20. 11 Spark and Tensor Flow libraries sequentially to run it. We use it through the first container and automatically calculates the CPU share, etc., and auto-scaling through Kubernetes. The machine learning logic allows the user to access via example.com/ {hash value}. When the corresponding URL is generated, the program of (1) is automatically notified to the user through web socket through the program of C do. The user can access the program through the URL of the corresponding hash value and know the processing result (preparation, running, error, completion). When the processing is completed, the user can additionally know the URL.
  21. 21. 12 Chapter 3. TechnologyDescriptions Drag and Drop We have implemented the drag and drop feature in the UI for file uploading using DropzoneJS. DropzoneJS is an open source library that provides drag and drop file uploads with image previews, when the user of the system wants to upload a dataset (in .csv etc format) he can either directly drag and drop the files in the predefined area or he can select and upload the files. We have used the nodejs with express version of dropzoneJS for implementation purposes. Authentication Our program uses web token for user login and session management. MySQL connected to Node.js manages user DB. To connect with this Node.js, we use web token in HTTP Header for authentication processing such as CSRF to do basic UI and file upload related work in Node.js . There are two types of authentication systems: 1) Facebook Login: After registering as a third-party application on Facebook, allow users to log in to their Facebook account. 2) Database Login: Users can sign up with simple form and log in with their e-mail and password. File Upload Our program provides a file upload system for user’s input and output files. Simply we’re using the Dropzone.js library that allows user to upload CSV and PSV files by drag
  22. 22. 13 and drop. Afterwards, the user can check the row and column of the file so that it can be used as input to our system. The user needs to upload the dataset as a file on his / her account before using the file as an input to any process chain. The uploaded file from the web client goes to the file storage of the python server where it is stored on the HDFS store. The mapping of the filename along with the root path with the IDs of the files is stored in the database. Whenever the user processes a dataset with any node chain or process chain, a new dataset / result is generated which is saved into a separate file with a “processed” extension to the original filename. This creates a separate entry into the database to identify this result as a separate usable or downloadable file. Our system supports previewing the file size when user finish the drag and drop. After uploading, user can check the name in the sidebar and use it as an input file by dragging and dropping. File Download File download will be handled by simple REST request to the python backend services. A list of all the available datasets – be that an original dataset of a processed dataset from a node chain processing in past – will be available to the user in his / her account which will all have access links to the download file. This link downloads the mentioned dataset as a file on the machine. This creates a REST call which basically sends the file as a FileIO object to force the file to be downloaded rather than to be shown on the browser or previewed.
  23. 23. 14 Backend Routing Our program supports sending the user's data wrangling and machine learning logic to the backend system along with the input file. Therefore, the corresponding backend is split into several routes to support this, so that HTTP communication can actually send and receive files and run the machine. Both the backend routing parts take care of two different set of operations. One is the Node.js routing module which being on the client side, will take care of the routing of the HTTP REST requests from the web client. This part is used to receive the node chain and other inputs from the client and ensuring the right request to the right backend service. Also, it is responsible to communicate to database and provide business logic in cases where the data processing and the interaction to the analytics frameworks is not needed. These will mostly be the tasks involving the CURD operations with the database. The other part on the other hand will be the python services that will take care of primarily the data processing and node logic execution. This code will take care of the data processing logic that the node chain mentions in the input. This module interacts with the data analytics and machine learning frameworks installed in the backend and provides the modular execution of the nodes needed for the user’s logic execution. It also provides a way to save the user’s logic as a node of its own and use that in future. To use map, reduce and filter nodes with the other built in nodes, the user is prompted for the code to be used with map, reduce and filter as input parameter. In the other inbuilt nodes, the parameters are just a way to control and modify the behaviors of the nodes ever so slightly. And they
  24. 24. 15 will be expected as a JSON of all parameters to be provided. They will dictate the processing logic of the node and give the desired output to the user for any built-in node or custom node the user uses. Database Connections Our program uses a DBMS for user session management and storage of processed input files. At this point, MySQL will be used to store user information and the user’s custom designed logics. Both the backend modules connect to the database for different purposes. The routing module’s work is for the most part to perform CURD operations on the application and different entities in the application. It will have a heavy use of the database connection and will mostly perform the CURD operations of each entities in the database. On the other hand, the work of python services is to process the data and generate the output dataset by executing the logic on input data. This will basically need to use database connection only for accessing the file path and name mapping with the file ID and for adding new entries to dataset entity when new processed dataset is generated. This will have a very limited connectivity to the database and most of the work for this module will rely on the heavy processing independent from the database itself. Distributed Processing Frameworks The frameworks that we have decided to use as the data analytics and machine learning distributed processing core are Apache Spark and Tensor Flow. These two are the most popular ones in their respective tasks. The Apache Spark for distributed processing of data for analysis purposes and Extract Transform and Load operations is a standard in the industry and is widely used, deployed and researched upon. Also, for deploying flexible
  25. 25. 16 and robust machine learning modules, Tensor Flow from Google is a famous product that provides a lot of freedom for to design the machine learning models. These frameworks are installed and configured into the environment on an image which will be used in cloud as the base machine image for new elastic instances.
  26. 26. 17 Chapter 4. ProjectDesign GUI Design(View) Figure 2. Login Template The main screen starts from login. The reason is that our program wants to preserve the user's independent data, and provides a login for this. It also has a Facebook login so users can log in conveniently.
  27. 27. 18 Figure 3. Login Template Users can also sign up through simple information. All fields must be filled in, and IDs are automatically checked for duplication.
  28. 28. 19 Figure 4. Main Visualized Tool The above screen is the main screen of our program. There is a menu on the left, and a canvas where the user can drag and drop. Also, after login, simple user information is displayed on the upper left corner, and buttons for uploading, downloading and checking result exist below.
  29. 29. 20 Every data flow that previous draw by the user will be automatically loaded. While the user does the drag and drop, every node, the flows, and its data will be automatically saved in both database and web browser. Figure 5. File Uploads & Preview The Upload button allows the user to navigate to the upload screen. Our program uses an upload library called Dropzone.js. It can also be dragged and dropped, and you can see the file size and upload status during user upload file. So far, users can only upload CSV and PSV files at present, and will add XML and XLS files later. There are two buttons for upload function: 1) Upload: upload multiple files to temporarily folder 2) Next: once the upload is done, the user can see the preview of the file to add, modify or remove each row for user’s purpose.
  30. 30. 21 . Figure 6. Manipulate the File will be uploaded Once the uploaded file is uploaded as a temporary file, the user is given a simple tool to edit that row and column. This tool allows the user to delete unwanted rows and columns. Figure 7. Add new row GUI
  31. 31. 22 User can add custom row via [Add New Row] button and write down some custom row. Plus, user also modify and remove current row by using edit and remove button located in the rightest side. Figure 8. Drag and Drop Drag and drop is the most important feature of our program, and the user can drag the necessary resources through the menu on the left. The user-uploaded file will appear below the Input menu. In addition, the user can specify the format of the output file under JSON, CSV, XML, etc., along with the machine learning functions such as Filter, Reduce, and Map.
  32. 32. 23 Figure 9. Draw the Lines Between Nodes The user can define the action through a circle below the dragged node. The circles below define the output, and the output file can only be input to another node. The above is the format of the example. After defining the node as above, you can define the detail attribute by double clicking on each node, and execute the corresponding logic through the Run button.
  33. 33. 24 Figure 10. Dialog to Look and Modify the Input File The dataset (input file) can modify by using dialog. Once user double clicks the dataset node, user can see the exactly same screen of upload preview. User can also add new row, delete and modify any row by his or her purpose.
  34. 34. 25 Figure 11. Dialog to Make User’s Custom Script (e.g. Filter) User can write down his or her logic. Appendix A is the every node’s screenshot and field definitions Figure 12. Dialog for Output
  35. 35. 26 User also set the output format as well as its properties. We currently have two fields: isSorted and Limit. Controller Architecture Controllers: The controllers used in our Application: 1. LoginCtrl The login controller is used when the user initially logs into the application using his email id and password, the user can login using Facebook also. If he is not authorized then a "not_authorized" alert is thrown or if there is any other error while login then a "error" alert is thrown to the user. 2. RegisterCtrl
  36. 36. 27 The Register Controller is used when the user wants to register for new account, details like First name, Last name, Email, password are required to be entered. If either of the fields are left empty then a "First Name is empty!" Etc alerts are thrown. 3. AppCtrl This controller contains the footer and navigator pages. The navigator page contains buttons like upload, run and result, here the user can upload desired files, run them and can obtain the result. 4. MainCtrl It is the main Controller of the visualAnalyticsApp. The main controller draws the actual flowchart and takes action on it. There are five major actions in our program: 1) Add a flowchart: Initialize and define the actual flowchart object. 2) Adding a node to a flowchart: It is responsible for adding a node and double clicking. Each time a node is added, the script in the global variable is updated and the corresponding json is sent to the server to maintain the latest node structure for each user. 3) Select and delete node and flow: You can delete selected node and flow line. 4) Double-clicking node: When each node is double-clicked, the corresponding attribute will be displayed in the form of modal dialog. The attributes are largely divided into input, tool, and output. 5) Flowchart reset: Clear all nodes and lines and initialize data. 5. UploadCtrl
  37. 37. 28 This controller is used to upload files to the server, the user can upload multiple files and can set the file size and number of files while doing so and if the file size exceeds the set size or if it does not match the desired file type then the respective file will not be uploaded. 6. UploadPreviewCtrl After uploading the files, the user can preview them and edit using this controller. If the user has uploaded multiple files, then he can preview and edit all of them sequentially. Pagination feature is included in the page which enables only 100 rows to load at a time for faster loading. Once edit is done the changes made are saved in the files which are further used by the app for processing. 7. ResultCtrl This controller is used to provide user job history, it shows the job_id, status along with a corresponding result_url. Users can log in through their email address or Facebook account. The reason you need to log in is to load the machine learning logic you have drawn. After login, user data stored in MySQL linked with Node.js is loaded.
  38. 38. 29 Authentication Procedures Figure 13. Facebook OAuth Figure 14. Sign- In We’ll use the basic authentication procedure of Facebook. Once user successfully login via
  39. 39. 30 Facebook, user can get the JWT (Json Web Token) that share with server. Every HTTP request should be communicated with JWT to maintain the login session. Frontend Router Definitions Frontend has a nine part of routing functions: Router Descriptions POST /users/addUser Register new user with parameter. It will check whether user already there. GET /users/login Do the login. It will return entire user’s information as json GET /users/FBlogin Do login with Facebook. It will return entire user’s information as json GET /users/updateScript Update user’s script GET /users/userHistory/:seq Get user’s job history as user_seq POST /files/file_upload Upload file GET /files/get_file_list Get file list GET/file/get_file/:file_name Get one file Table 1. Frontend Router Definitions Backend Router Definitions Backend has a three part of routing functions: Router Descriptions /process/<dataset_id> Depends on dataset_id, our Apache Spark instance will run the dataset_id with user’s specific request like map, reduce, filter, etc. /upload Support the upload of input file /download/<dataset_id> Download the result of Apache Spark based on dataset_id Table 2. Backend Router Definitions The backed router connects the user actions (drag & drop of input dataset node, Spark algorithms like Map, Reduce, Split, Filter, etc., and output node (CSV, JSON, etc.)) to the backed Python services. The backend router collects the data from the front end draggable
  40. 40. 31 interface and makes the data required for sending to Python API. The Python API then goes through the HTTP request received and executed the file through the Spark nodes respectively to generate a processed file. HTTP Router Request: The below code shows the HTTP request in Angular.js code $http({ method : "POST", url : 'http://localhost:8080/process', data : { "dataset_id" : dataset_id, "node_chain": jsonFile, "output" : output_options } }).then((results) => { alert(results) }, (error) => { console.log("Error", error); }); The Python API HTTP request: The request has 3 important parameters.
  41. 41. 32 1. Dataset_id: This is an unique dataset Id for identifying the input node user wants to process. 2. Node_chain: This is an array of JSON’s used for specifying the list of algorithms the user has dragged to make a model. These algorithms will be executed sequentially, and final processed file will be collected. Sample JSON:
  42. 42. 33 node_chain = [ { "node": "map", "logic": <User Logic> }, { "node": "reduce", "logic": <User Logic> }, { "node": "filter", "logic": <User Logic> }, { "node": "extractUsingRegex", "params": { "regex": <User Written regex>, "column": 0 } } ] 3. Output: The output has 3 sub parameters. They are listed below.  output file format (CSV, JSON, etc.)  IsSorted (if the file needs to be sorted after being processed)  Limit (Rows limit after being processed)
  43. 43. 34 Database Schema The below tables are used in the database design. The tables schema diagram shows the tables to store user information, user transactions, file URL’s and link them. Figure 15. Database Schema
  44. 44. 35 Python Services Interface to Frameworks The design of the python services that acts as an interface to the machine learning and data analytics frameworks will be modular in nature and flexible to be extended and modularized on the fly. These modules have node logic saved on different files to be executed in the service local storage. As soon as a node is invoked, the file corresponding to the node is imported and the logic is appended to the object of the input dataset. This way, a chain of operations to be performed on all the nodes is appended on the node chain and then the whole chain is executed rendering the final output of the process. The map, reduce and filter nodes differ in behavior slightly from the other regular nodes. They will have an input code which needs to be imported while applying the map, reduce and filter logics. This makes need to save the input code to be saved in the python module on the fly. This enables the input code to be saved on file and to be loaded instantly using the module to be used into map, reduce and filter nodes. After this map, reduce or filter node is applied at the node, the loaded module is considered obsolete and the file with the code is deleted from disk. The custom node is another exception in the behavior of the node. The user can input his custom code in the web UI to be executed for the node. He / She can choose to name the node explicitly and save the node logic under his / her account history. This way the user can access the node at any time in the future. If the user decides not to name the logic and let it be anonymous, like in almost all the programing languages with anonymous functions,
  45. 45. 36 the logic will be saved to temporary file and be deleted after serving the purpose and cannot be used later on for the later executions. The nodes are intended to be of two different types like input node, output node and processing nodes. The input nodes read the input from the file using the database mapping for file name, path and IDs. They can be used as the starting points of the node chain. The processing nodes basically append the node logic for each node to the input dataset object to be executed while expecting the output from it. The output nodes will execute the appended node logic on the dataset to obtain the output from it which will be saved in file, shown on the web portal or directly downloaded according to the options and parameters selected by the user.
  46. 46. 37 Chapter 5. ProjectImplementation Authentication There are two types of authentication systems: 1) Facebook Login: After registering as a third-party application on Facebook, allow users to log in to their Facebook account. Since all Facebook users have their own unique key values, when they log in, we store their keys and user’s information in the browser's cookie and our DB. 2) Database Login: Users can sign up with simple form and log in with their e-mail and password. In this case, the password is encrypted with Bcrypt library that encrypted and hashed with a unique salt value. FB.login(function (response) { // handle the response if (response.status === 'connected') { // Logged into your app and Facebook. FB.api('/me?fields=email,first_name,last_name,link', function (fb_userinfo) { const { email_id = "", password = "", first_name = "", last_name = "", id: user_name } = fb_userinfo $scope.userinfo = { user_name, email_id, password, first_name, last_name }; $http.post("user/FBlogin", $scope.userinfo) .then(response => {
  47. 47. 38 localStorage.setItem("__USER_INFO__", JSON.stringify(response.data)); // Save User Info into LocalStorage localStorage.setItem("__USER_SCRIPT__", response.data.script); $state.go('index.main'); }).catch(error => { alert(error.data); }) }); } else if (response.status === 'not_authorized') { alert("not_authorized"); } else { alert("error"); } }, {scope: 'public_profile,email,'}); Table 3. Part of Facebook Login Function Users can also sign up through simple information. All fields must be filled in, and IDs are automatically checked for duplication. $scope.register = function() { const {first_name, last_name, user_name, password, email_id} = $scope.userinfo; if(password !== $scope.password2){ alert("Password does not match!"); return; } if(first_name === ""){ alert("First Name is empty!"); return; } if(last_name === ""){ alert("Last Name is empty!"); return; } if(password === ""){ alert("password is empty!"); return; }
  48. 48. 39 if(email_id === ""){ alert("Email is empty!"); return; } $http({ url: "user/addUser", method: "POST", dataType: 'json', data: $scope.userinfo }).then(function successCallback(response) { // this callback will be called asynchronously // when the response is available //console.log(response.data) var __USER_INFO__ = response.data; localStorage.setItem("__USER_INFO__", JSON.stringify(__USER_INFO__)); $state.go('index.main'); }).catch((error) => { alert(error); }); } Table 4. Part of Sign Up Function After the user logs in, the user information is stored in the LocalStorage in the JSON format as described above, and the necessary information in the GUI can be fetched through the cookie. The logged in user can invoke his or her existing machine learning logic.
  49. 49. 40 Figure 16. Chrome’s Local Storage Content After logging in, the web token is stored in the browser's cookie so that the session can be maintained. The GUI periodically sends the ping to the node server, and if the token is invalid, the cookie is initialized and automatically logged out. Visualization Tools After analyzing our benchmarked Node-red, we found that it was built using HTML5's Canvas. So, we decided to use Jquery.Flowchart, a library that uses Canvas. Jquery.Flowchart supports basic drawing of node and its connection via graph. It also supports drag and drop so that we can use it to support movement and connection of each drawn node. Drag and drop linkage between the HTML DOM and Canvas is implemented using the Draggable function in the jQuery UI. The overall UI uses Bootstrap, MVVC management via AngularJS, front-end server management via Node.js and Express.js. Drag and Drop Drag and drop uses the functionality of the jQuery UI. In the navigation.html file, the left menu is defined, and each menu is bound to data- *. This data defines title, index, and so on. <!-- START Tree Sidebar Common -->
  50. 50. 41 <ul class="side-menu"> <li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data- title="map" data-idx="1" data-mode="tool"> <a href> <div> <div class="nav-label" style="z-index:10000;">Map</div> </div> </a> </li> <li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data- title="reduce" data-idx="2" data-mode="tool"> <a href> <span class="nav-label">Reduce</span> </a> </li> <li class="primary-submenu draggable_operator" data-nb-inputs="1" data-nb-outputs="1" data- title="filter" data-idx="3" data-mode="tool"> <a href> <span class="nav-label">Filter</span> </a> </li> </ul> Table 5. Part of Navigation DOM Source Code Every draggable node is associated with the draggable_operator class, which is draggable through the code in nav.js. The code below does the following: 1) Read the file from the server 2) Dynamically attache the read file to the dom. (To make the file list draggable and drop-able as well.) 3) Enable drag and drop with all other menus. (handle with $ draggableOperators.draggable ()) 4) Map draggable objects bound to (3) to getOperatorData (). In getOperatorData (), we create an object for actual future data processing with data such as data- * in the actual dom tag. $http({ url: "file/get_file_list", method: "GET" }).then(function successCallback(response) {
  51. 51. 42 // this callback will be called asynchronously // when the response is available //console.log(response.data) $scope.fileList = response.data; console.log(!lodash.isEmpty(response.data)) if(!lodash.isEmpty(response.data)){ response.data.forEach((file, idx) => { $("#inputFileList") .append(`<li class="primary-submenu draggable_operator" data-nb-inputs="0" data-nb- outputs="1" data-title="${file.name}" data-dataset_id="${file.dataset_id}" data-idx="${idx + 7}" data- mode="input" ><a href="#"> <div> <div class="nav-label" style="z-index:10000;">${file.name}</div> </div> </a> </li>`) }); } var $draggableOperators = $('.draggable_operator'); console.log($draggableOperators); function getOperatorData($element) { var nbInputs = parseInt($element.data('nb-inputs')); var nbOutputs = parseInt($element.data('nb-outputs')); var dataset_id = ($element.data('mode') === 'input' ? $element.data('dataset_id') : 0); var data = { properties: { title: $element.data('title'), inputs: {}, outputs:{}, dataset_id, mode: $element.data('mode') } }; var i = 0; for (i = 0; i < nbInputs; i++) { data.properties.inputs['input_'+ i] = { label: 'Input ' + (i + 1) }; } for (i = 0; i < nbOutputs; i++) { data.properties.outputs['output_'+ i] = { label: 'Output ' + (i + 1) }; } console.log(data); return data; }
  52. 52. 43 var operatorId = 0; $draggableOperators.draggable({ cursor: "move", opacity: 0.7, helper: 'clone', appendTo: 'body', zIndex: 1000, helper: function(e) { var $this = $(this); var data = getOperatorData($this); return $rootScope.flowchart.flowchart('getOperatorElement', data); }, stop: function(e, ui) { var $this = $(this); var elOffset = ui.offset; var $container = $rootScope.flowchart.parent(); var containerOffset = $container.offset(); if (elOffset.left > containerOffset.left && elOffset.top > containerOffset.top && elOffset.left < containerOffset.left + $container.width() && elOffset.top < containerOffset.top + $container.height()) { var flowchartOffset = $rootScope.flowchart.offset(); var relativeLeft = elOffset.left - flowchartOffset.left; var relativeTop = elOffset.top - flowchartOffset.top; var positionRatio = $rootScope.flowchart.flowchart('getPositionRatio'); relativeLeft /= positionRatio; relativeTop /= positionRatio; var data = getOperatorData($this); data.left = relativeLeft; data.top = relativeTop; $rootScope.flowchart.flowchart('addOperator', data); var data2 = $rootScope.flowchart.flowchart('getData'); $http.post('user/updateScript',{ script: JSON.stringify(data2), user_id: $rootScope.userinfo.user_id }); localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data2)); } } }); }, function errorCallback(response) { // called asynchronously if an error occurs // or server returns response with an error status. console.log(response.statusText);
  53. 53. 44 }); Table 6. Part of Navigation Controller Source Code As mentioned above, MainCtrl handles all actions in the canvas, and is responsible for dragging, dropping, selecting nodes, connecting nodes, etc. on the canvas. The most important thing in MainCtrl is to call / user / updateScript on all actions to update the latest edge and node information. $flowchart.on('linkCreate', function(linkId, linkData) { var data = $flowchart.flowchart('getData'); $http.post('user/updateScript', { script:JSON.stringify(data), user_id: $rootScope.userinfo.user_id }); $scope.script = data; localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data)); }); Table 7. Part of Main Controller Source Code (Link Creation) When the user logs in, the user information is kept in __USER_INFO__ in localStorage. Also, all data is kept in localStorage with a value of __USER_SCRIPT__. This is assigned at login by the user. $scope.script = $.parseJSON(lodash.isEmpty(localStorage.getItem("__USER_SCRIPT__")) ? {} : localStorage.getItem("__USER_SCRIPT__")); Table 8. Part of Main Controller Source Code (Load LocalStorage) One of the most important things in the main canvas is to define its properties when you double-click the node. To handle this, we added an event to the existing jquery.flowchart. In this function, we process the various data with user double click as follows. All objects
  54. 54. 45 in the global scope are managed, and data.operators [operatorId] .properties is an attribute for each data to be sent to the future server, where the user-written data is located. $flowchart.on('operatorSelect2', function(el, operatorId, returnHash) { var data = $flowchart.flowchart('getData'); var title = data.operators[operatorId].properties.title.trim(); var mode = data.operators[operatorId].properties.mode; $scope.tempNode.title = title; //$("#myModal").modal('show') if(mode === "input"){ // For Dataset (Input) Node var modalInstance = $uibModal.open({ templateUrl: 'views/modal/view_input.html', controller: function ($scope, $uibModalInstance, $http) { $scope.load = function() { // Omitted. File handling is same as upload_preview }, scope: $scope, windowClass: "hmodal-success", size: 'lg' }); }else if(mode === "tool"){ // For Logic Node var modalInstance = $uibModal.open({ templateUrl: 'views/modal/view_node.html', controller: function ($scope, $uibModalInstance, $http) { $scope.map = { logic: '' } $scope.filter = { logic: '' } $scope.reduce = { logic: '' } $scope.extractUsingRegex = { regex: '', column: '',
  55. 55. 46 } $scope.splitUsingRegex = { regex: '', column: '', } // Omitted.. same the other nodes $scope.cancel = function () { $uibModalInstance.dismiss('cancel'); }; $scope.submit = function() { var data = $flowchart.flowchart('getData'); $uibModalInstance.dismiss('cancel'); //$rootScope.script = $scope.script; switch(data.operators[operatorId].properties.title){ case "map": $rootScope.map = $scope.map; data.operators[operatorId].properties.params = $scope.map; break; case "filter": $rootScope.filter = $scope.filter; data.operators[operatorId].properties.params = $scope.filter; break; case "reduce": $rootScope.reduce = $scope.reduce; data.operators[operatorId].properties.params = $scope.reduce; break; case "extractUsingRegex": $rootScope.extractUsingRegex = $scope.extractUsingRegex; data.operators[operatorId].properties.params = $scope.extractUsingRegex; break; // Omitted. Same logics of the other nodes }
  56. 56. 47 console.log($rootScope, $scope, data); $http.post('user/updateScript', { script: JSON.stringify(data), user_id: $rootScope.userinfo.user_id }); $scope.script = data; localStorage.setItem("__USER_SCRIPT__", JSON.stringify(data)); $flowchart.flowchart('setData', data); console.log($flowchart.flowchart('getData')); } $scope.load = function() { switch(data.operators[operatorId].properties.title){ case "map": $scope.map = { ...data.operators[operatorId].properties.params } break; case "filter": $scope.filter = { ...data.operators[operatorId].properties.params } break; case "reduce": $scope.reduce = { ...data.operators[operatorId].properties.params } break; case "extractUsingRegex": $scope.extractUsingRegex = { ...data.operators[operatorId].properties.params } break; } // Omitted. Same logics of the other nodes } $scope.load(); },
  57. 57. 48 scope: $scope, windowClass: "hmodal-success", size: 'lg' }); }else if(mode === "output"){ // For output node var modalInstance = $uibModal.open({ templateUrl: 'views/modal/view_output.html', controller: function ($scope, $uibModalInstance, $http) { $scope.cancel = function () { $uibModalInstance.dismiss('cancel'); }; $scope.submit = function() { $uibModalInstance.dismiss('cancel'); data.operators[operatorId].properties.limit = $scope.output.limit; data.operators[operatorId].properties.isSorted = $scope.output.isSorted; $flowchart.flowchart('setData', data); $rootScope.output = $scope.output; } }, scope: $scope, windowClass: "hmodal-success", size: 'lg' }); } }) Backend The backend is implemented using Python. Through a Python object connected to Apache Spark, the lightweight web server is used to receive and deliver data as the router is designed. The dataset when loaded, is converted to the Resilient Distributed Dataset structure object into Apache Spark. This enables the parallel processing of the dataset on
  58. 58. 49 the Apache Spark cluster. This input RDD is appended a bunch of operations for node logic execution by different nodes in the chain. This makes the RDD a collectible object and the process chain can be executed at the time when the RDD needs to be collected and the result needs to be obtained. Finally, by the output node, the RDD is collected and the result is saved according to the user’s choice in the web portal input. The output file mapping is saved to the database with the mapping from dataset ID to root path and file path.
  59. 59. 50 Chapter 6. Testing and Verification Unit Test UI Testing S.NO Action Performed Expected Result 1. Navigate to the Website using the URL It should open our application screen 2. Drag and Drop the nodes representing the algorithm into our application The corresponding nodes will be selected and dropped into our application 3. Drag and drop the map and reduce nodes onto our application nodes The corresponding map and reduce nodes should be selected and dropped into our application 4. Click the Execute button to start processing The process starts executing and a URL will be generated 5. Use the URL generated to view the status of the result to be generated The generated URL will display the status of the result till the ML algorithm and other
  60. 60. 51 analytics logic is applied and results are generated 6. Save the result Sends the saved output to the User Table 9. UI Testing Load Testing Load testing is performed on an application to check its response when the number of users accessing the application is increased. It is used to check the performance of the system under both normal and peak load conditions. We are using Jmeter to perform load testing on our application. It creates a simulation of a group of users who are sending requests to the target server and records the performance of the application through graphs etc then and when the number of users is gradually increased. Steps to perform Load testing on our application. 1. Create a thread group to include a group of Users. 2. Add other Jmeter components like Http request etc. 3. Include how we want the result to be displayed (graph etc.) 4. Execute the test and compare the test results through parameters like throughput, deviation etc.
  61. 61. 52 Performance Testing Mocha is used to perform asynchronous performance testing. It is used majorly to test Node.js applications. We use mocha to test if the method logic was executed correctly by checking the response of the Rest API passed (200, 404 etc.) Example: Mocha test to test if the response of 200 is accurately returned if the correct ML algorithm is passed when it is dragged and dropped. Code: If the correct response is obtained, then we will know that our method logic is being correctly executed. Authentication Test First, authentication test need that user can go to the main page after signing with Facebook account or email. For the first-time user, the session information must be stored in the DB when logging with Facebook, and in the case of email login, the user must be able to sign up via form. In the case of sign up, email must be checked whether duplicated, and the password must be encrypted properly through Bcrypt library.
  62. 62. 53 Also, at the time of email login, pass through HTTP basic authentication through HTTP header. The data must be base64 encoded. In the case of existing users, the user information should be stored in cookies as JSON after login, and the machine running logic that has been drawn must be loaded. The left sidebar should contain information such as the user's name, and when logged out, the user information should disappear from the cookie. Data Transaction via Router Test Test that the function corresponding to all the specified routing functions works. Add expected inputs and outputs later. Integrated Test Test Scenario All tests scenario goes through the following steps: 1) User login or membership 2) File upload 3) Drag and drop 4) Connection the point between nodes 5) Test if data is normally transferred to the router during execution 6) Apache Spark: test if the user can check the output of the backend automatically. 8) Test if the user can check the output automatically through Web Socket when the result is displayed.
  63. 63. 54 Figure 17. Test Scenario LON LAT NU MBE R STRE ET UNIT CITY DIST RICT REGI ON POS TCO DE ID HASH - 118.2 81455 5 37.16 5987 110 N PIPE R ST BIG PINE 848e5d7 a57c080f 8 - 118.2 83771 8 37.16 5678 7 151 N RICHARDS ST BIG PINE 9e5b461 661958e de - 118.2 82186 37.16 5876 3 155 N PIPE R ST BIG PINE 85bc70e 3800d23 3b - 118.2 80784 3 37.16 4934 2 214 N PIPE R ST BIG PINE d44de73 16ab829 3a - 118.2 80784 5 37.16 4667 8 242 N PIPE R ST BIG PINE c0478e2 afefb3b5 7 - 118.2 37.16 5196 204 N PIPE R ST BIG PINE fa38d2e8 030b1c2 7
  64. 64. 55 81462 9 - 118.2 81470 6 37.16 4396 5 264 N PIPE R ST BIG PINE c4924f35 7414fde1 - 118.2 81468 1 37.16 4663 244 N PIPE R ST BIG PINE fc8a7b2c 7a41baa 1 - 118.2 80785 9 37.16 3068 9 392 N PIPE R ST BIG PINE 5cb8780 4d2374e b9 - 118.2 81475 8 37.16 3863 5 296 N PIPE R ST BIG PINE fc97a9f0 7cc4ce94 Table 10. Test Input Test Result Test result will be added at the end of project.
  65. 65. 56 Chapter 7. Performance and Benchmarks The performance is measured on how responsive the application is, and how fast the upload file, download file, and Machine Learning algorithm executes. While the application navigation functionality like login, drag & drop, etc. should be quick and within a second. The execution time for Machine Learning algorithm will depend on the size of the data set uploaded, and type of algorithm used. This time can be anywhere between 10 seconds to 5 minutes. The project (application server, database, Apache Spark, etc.) has been deployed on Google Compute Cloud which decrease the down time of the application. This cloud deployment provides the application availability time of 99.96%. There were hundreds of tests performed from above given test cases like unit test cases, etc. and the below performance benchmarks were collected. The below information is the average time taken for all the relevant test cases. Below table describes the performance and benchmarks for the common functionalities. Type of event Operation Performed Average Time Observed (in sec) Application Navigation Login < 1 sec Application Navigation Sign up < 1 sec Application Navigation Navigate to the Website using the URL < 1 sec Design ML model Drag and Drop the nodes representing the algorithm into our application Instant (~ < 100 ms) Design ML model Drag and drop the map and reduce nodes onto our application nodes Instant (~ < 100 ms)
  66. 66. 57 Dataset Uploading file 3 Sec/ 20 MB file Dataset Download output file 3 sec/ 20 MB file Machine Learning algorithm execution Map algorithm from Apache Spark 4 min/ 10 MB file Machine Learning algorithm execution filter algorithm from Apache Spark 4 min/ 10 MB file Table 11. Performance and Benchmarks
  67. 67. 58 Chapter 8. Deployment Describe any deployment strategies, operational needs, and maintenance required for your project. Deployment plan Figure 18. Deployment Plan The deployment plan for our project is shown in the diagram above. First, the developer develops by creating a new git branch in the local area. Since we use the git branch strategy, there will be a master branch by default, and another version-specific [release-major-minor-patchlevel] branch by developer's on-demand. For example, the development of the first version will be through a branch called release-0.1.0, and each developer will create this branch locally on his or her site. To prevent conflicts between
  68. 68. 59 developers, avoid conflicts between modules for each version and prevent them through periodic meetings. Development is done through the IDE (e.g. IntelliJ, Atom), and development is performed by the local machine through the task manager so that it can be tested and localized. The developer can pull requests to our public GitHub repository for tests and local execution, and if there are no conflicts between the local tests and the global tests, the team leader accepts the pull request and places them in the master branch. We will use the continuous integration tool Travis CI, and it will continually hook the master branch, and will automatically perform scheduled tasks when a hash value change is detected. We will proceed with Docker packing for each version that has passed the test, and we will do the following three Docker packing in our source code: 1) GUI (frontend) 2) Cloud Service (backend) 3) Machine Learning Library The packaged Docker containers will then automatically be placed in the Google Container Repository, which will be manually deployed by the developers themselves using Kubernetes. The deployment is done automatically and sequentially for each container in the Google Container. If an error occurs via browser, we can revert to the previous version via Kubernetes. When an error occurs, follow the Git branch strategy to create a branch corresponding to [hotfix-major-minor-patchlevel] and go through the above distribution process.
  69. 69. 60 Chapter 9. Summary, Conclusions, and Recommendations Summary Often any user who wants to use analytics or Machine Learning typically has to follow the strict set of rules and guidelines and should also take care of setting the environment ready to do the processing prior to using any machine learning algorithm. There is a need of highly skilled developers to leverage the analytic solutions, or machine learning models to use them efficiently to produce the best results. We wanted to make this process easy and simple to the users without having to go through all the tedious process. Our final project will be an interactive web-based UI which allows the users to drop and drop the dataset files or components for processing them, the user will be able to choose between many available machine learning algorithm or include his own code required for the procedure after which the user will be able to upload from a given set of file upload formats. Once the user drag and drops all his components and selects the button 'Process' in the UI, the back-end system implemented in python and nodejs will automatically run the corresponding optimized services and will provide different URLs for each of the process. A wide list of tensorflow libraries will be used in our application and allows the user to do data preprocessing after uploading. The end result will be provided to the user which can be downloaded into his system. Since the application is Software as a service, the users need not install any software and use the application, access the results from anywhere and is extremely easy to use. As all the major processes are handled by the application, it greatly saves the users time and make them highly efficient.
  70. 70. 61 Conclusions Earlier days the process of collecting the data were through surveys, FAQ's etc but as the technology evolved so did the ways to collect and record data followed by the use of complex machine learning algorithm and techniques to process this huge collected data. These days just to efficiently and effectively use the machine learning algorithm can be a very difficult task with a person needing to have in-depth and a thorough knowledge of it before applying it. Using our web-based visual machine learning application, the user can just drag and drop the dataset files, objects or components such as inputs or any popular machine learning algorithm that are needed for the user's procedure, they can also link different machine learning algorithms if required to get the final output which will be the combined result of all the algorithm components selected, the proposed back-end system will now run optimized services and provide different URLs after which the results are dynamically displayed to the user which he can download at his convenience. This application will allow the users to use and apply complex technologies for their data without having to spend time mastering the technology or even having an in-depth knowledge about it. Recommendations for Further Research Some of the recommendations that can be included in the future to our project which may further increase its performance are as follows, • To include and support more up loadable file types extensions when the user wants to upload dataset with different formats. • To be able to modify the dataset manually after uploading it. • Updating live progress to the user about the output when many components are included.
  71. 71. 62 • To include a wider range of Tensorflow Libraries in the application which the user can use.
  72. 72. 63 Glossary Analytics : Analytics is the representation of the data in a meaningful manner. It may include interpretation, discovering, predicting, and describing meaningful pattern hidden in the data. Akka : Akka is an open source toolkit used to achieve concurrent and distributed application via message based, asynchronous communication. Angular.js : Angular.js is open source JavaScript based framework for developing front end web applications. Apache Spark : Apache Spark is an open source set of libraries which provides data transformation methods with 100 times faster performance than Hadoop. API : Application Programming Interface is a set of routines, functions, protocols for building software. Dashboard : Dashboard is a web application which consists of GUI elements together representing some actions for user. Docker : Docker provides abstraction of different images on single system to achieve isolation, and ability to run multiple similar processes in isolated containers. Drag and drop : Using mouse to interact with elements on screen by clicking and moving around is often called as drag and drop functionality
  73. 73. 64 GUI : Graphical User Interface (GUI) is a visual interface which is used for a user to interact with system or an application. Machine Learning : It is a field of computer science which gives computers an ability to learn and act on the learning model without being explicitly programmed. MongoDB : MongoDB is an open source No-SQLdatabase used to store non structured data. Node RED : Node RED is an open source application provides an interactive web application which is used to connect IoT devices with drag and drag functionality. Node.js : Node.js is open source JavaScript engine which can run on server side to achieve maximum performance. Socket.io : Socket.io is a JavaScript library used to develop real time live communication systems like Chabot’s. TensorFlow : TensorFlow is an open source set of libraries which provides numerous machine learning algorithms and maintained by Google. URL : Unified Resource Locator is a unique identifier for identifying the resource name. E.g.: Web site address. Table 12. Glossary
  74. 74. 65 References 1. Aziz Nasridinov, Young-Ho Park, (2013). Visual Analytics for Big Data Using R, Cloud and Green Computing (CGC), 2013 Third International Conference on. DOI:10.1109/CGC.2013.96 2. Daniel Keim, Human Qu, Kwan-Liu Ma, (2013). Big-Data Visualization. IEEE Computer Graphics and Applications. DOI: 10.1109/MCG.2013.54 3. Khiari Reda, Alessandro Ferretti, Aaron Knoll, Jason Leigh,… & Micael E Papka,(2013).Visualizing Large, Heterogeneous Data in Hybrid-Reality Environments, IEEE Computer Graphics and Applications. DOI: 10.1109/MCG.2013.37 4. Marc Streit, Oliver Bimber, (2013). Visual Analytics: Seeking the Unknown, Computer. DOI: 10.1109/MC.2013.255 5. Dan Cheng, Zhibin Mai, Jianjia Wu,(2006). A service-Oriented Integrating Practice for Data Modeling, Analysis and Visualization. e-Business Engineering, 2006. ICEBE '06. IEEE International Conference on. DOI:10.1109/ICEBE.2006.12 6. Josua Krause, Adam Perer, Kenney Ng,(2015). Interacting with Predictions: Visual Inspection of Black-box Machine Learning Models. DOI: 10.1145/2858036.2858529
  75. 75. 66 Appendices Appendix A. Every Node’s Dialog Screen and Its Definitions Node Dialog Screen Map Reduce
  76. 76. 67 Filter ExtractUsingRegex SplitUsingRegex
  77. 77. 68 SplitUsingDelimiter Duplicate MergeWithDelimiter
  78. 78. 69 FilterWithParameter FilterUsingRegex Slice
  79. 79. 70 ConvertTypeTo AddColumn ChooseColumn Flatten
  80. 80. 71 ReduceBy SortBy Distinct N/A takeTop ParseUserAgent
  81. 81. 72 ParseDateTime Remove Header N/A

×