Thesis blending big data and cloud -epilepsy global data research and information system


Published on

Published in: Healthcare
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Thesis blending big data and cloud -epilepsy global data research and information system

  1. 1. Blending Big Data and Cloud - Epilepsy Global Data Research and Information System BITS ZG629T: Thesis by AnupSingh 2012HZ12707 Thesis work carried out at Tata Consultancy Services Limited, LCH.Clearnet Limited, Investec Bank Plc London, Birmingham Cancer Research Institute, United Kingdom Submitted in fulfillment of M.S. by Research - Software Systems Under the Supervision of Sandeep Patil, Researcher in NASA, Arlington University, Ex. BARC Sr. Scientist Kalwar Shivram, Project Manager, Tata Consultancy Services Limited, SanJose, UnitedStates Professor B.M. Deshpande, BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE PILANI (RAJASTHAN) April, 2014
  2. 2. ABSTRACT Epilepsy is the most common neurological disorder affecting 65 million people worldwide. While medications and other treatments help many people of all ages who live with epilepsy, more than a million people continue to have seizures that can severely limit their school achievements, employment prospects and participation in all of life's experiences. It strikes most often among the very young and the very old, although anyone can develop epilepsy at any age. Its prevalence is greater than autism spectrum disorder, cerebral palsy, multiple sclerosis and Parkinson's disease combined. Despite how common it is and major advances in diagnosis and treatment, epilepsy is among the least understood of major chronic medical conditions, even though one in three adults knows someone with the disorder. Epilepsy Global Data Research and Information System is aimed to leverage Big Data, Cloud Computing, Datawarehouse features to build a global system which will help the doctors, neurosurgeons to use the information and methodologies to treat the childrens and people worldwide. Objectives  This initiative is aimed to build a federated database of medical information and services that act to serve as the platform for medical research into neurological cases of epilepsy.  Providing access to very large data sets on patients with different neurological disorders help the researchers, doctors, surgeons to make efficient decisions and share their experiences.  Best treatment to be given to childrens and other people all over the world.  System to enrich and enhance its knowledge base so as to stimulate new questions about Epilepsy and its symptoms – and, ultimately, lead to the fruitful answers on its treatment.  To harness super-computer power and capabilties of Big Data and Cloud Computing. Broad Academic Area of Work: Cloud Computing, Big Data, Datawarehouse. Key words: Hadoop, Twitter Apps, Spring XD, HBASE, HDFS, MapReduce, Hue, Hive, Pig, HCatalog, JSON Serde, Flume.
  3. 3. ACKNOWLEDGEMENTS I would like to express my since gratitude and deep regards to my supervisor and additional examiner for their constant motivation, monitoring and guidance throughout the course of Dissertation work. This is indeed a new beginning for professionals like us to extend technology beyond boundaries in healthcare. The blessing, guidance and help had given me to begin this journey. My prime motivation behind this dissertation is my loving nephew Aakash who is being treated from epilepsy since past seven years and all childrens over the world. My sincere regards and appreciation is extended to Dr. Vrajesh Udani, Hinduja Hospital, hospital staff, Mumbai, Dr Neeta Ajit Naik, Sion, Mumbai who are the pioneers in treating epileptic childrens in India. I virtually would like to thank my family for motivating me to build this. It would have been not possible without the constant support and help from them. Indeed we have a lot to go beyond this. AnupSingh
  4. 4. TABLE OF CONTENTS Chapter No Topic Page No 1. Introduction: Understanding the power of Big Data, Cloud features 1 2. Feasibility Study and Analysis of Algorithms, Application Methodologies 2 3. Architecture Design of the System 4 4. Cloud Design of the Epilepsy Global Data Centre 5 5. Data Storage Structure and Query Processing in HDFS and HBASE 6 6. Use Cases Overview 9 7. Conclusion and Recommendations 22
  5. 5. BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI WORK-INTEGRATED LEARNING PROGRAMMES DIVISION Second Semester 2013-2014 Introduction: Understanding the power of Big Data, Cloud features Data analysis on large volumes in fields like epilepsy, cardiac diseases, genetic, neuroimaging, etc. on group of individuals with shared and variable characteristics or subjects remains poorly approached as well as understood. Hence the very significant challenges in terms of storing, accessing, building, accuracy and implementing complex computations cannot be achieved with the traditional methods of data warehouse. Globally as well as locally many families from different geographies in rural, urban areas along with the modern sophisticated hospitals are unaware of different types of health diseases, symptoms, medicines and health care solutions. Sharing a structured and unstructured knowledge base amongst researchers, neurologists, doctors, associates, parents is a must. There is a need of specific scientific environment as well as automated software applications along with cost reduction to complement the above scenarios. Epileptic disease among children need to be bridged a gap by leveraging the technological revolution and predicting as well as finding new improved ways of cure. Matured methodologies like Kimball's approach, Enterprise Wide DataWarehouse (EDW), traditional RDBMS, ETL/ELT approach is insufficient for huge amount of epileptic data. Over the course of years we have Terabytes to Petabytes to Zetabytes of unused data which can be transformed, utilised, reengineered to device new findings to cure epilepsy. We need better data access, data storage and data structures techniques. Big Data environments create the opportunity to ease some of the rigidity of ETL-driven data integration processes. The nature of big data requires that the infrastructure for this process can scale cost-effectively. Hadoop*, MongoDB has emerged as the standard solution for managing big data. Big Data refers to the large amounts, at least terabytes, of poly-structured data that flows continuously through and around organizations, including video, text, sensor logs, and transactional records. Rapidly ingesting, storing, and processing big data requires a cost effective infrastructure that can scale with the amount of data and the scope of analysis. Hadoop has rapidly emerged as the de facto standard for managing large volumes of unstructured data. Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct- attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Cloud computing has emerged as a viable alternative to the acquisition and management of physical or software resources. Scientific applications are being ported on clouds to build on their inherent elasticity and scalability. The application needs to run in parallel on a large set of resources in order to achieve reasonable execution times. Cloud platforms, such as Amazon Web Services, Azure, Cloudera, are an interesting option to tackle this problem. They provide High Performance Cloud Computing Infrastructure for handling epileptic "Big Data" variability and provides some eased as well as optimized deployment configurations. We will be using Amazon Web Services (AWS) to blend the features of Big Data and Cloud Computing. 1
  6. 6. Feasibility Study and Analysis of Algorithms, Application Methodologies Assumptions: Representation of all the features of Big Data and Cloud is out of scope and can be taken for separate research areas in epilepsy and other healthcare problems. We will use Amazon EMR with the Hortonworks Distribution for Hadoop. It makes it easy to provision and manage Hadoop in the AWS Cloud. Hadoop is available in multiple distributions and Amazon EMR gives you the option of using the Amazon Distribution or the Hortonworks Distribution for Hadoop. Hortonworks delivers on the promise of Hadoop with a proven, enterprise-grade platform that supports a broad set of mission-critical and real-time production uses. Hortonworks brings unprecedented dependability, ease-of-use and world-record speed to Hadoop, NoSQL, database and streaming applications in one unified Big Data platform. Hortonworks is used across financial services, retail, media, healthcare, manufacturing, telecommunications and government organizations Hadoop for Big Data and Cloud Hadoop is an open source distributed software platform for storing and processing data. Written in Java, it runs on a cluster of industry-standard servers configured with direct- attached storage. Using Hadoop, you can store petabytes of data reliably on tens of thousands of servers while scaling performance cost-effectively by merely adding inexpensive nodes to the cluster. Central to the scalability of Hadoop is the distributed processing framework known as MapReduce. MapReduce, the programming paradigm implemented by Hadoop, breaks-up a batch job into many smaller tasks for parallel processing on a distributed system. HDFS, the distributed file system stores the data reliably. 2
  7. 7. MapReduce helps programmers solve data-parallel problems for which the data set can be sub-divided into small parts and processed independently. MapReduce is an important advance because it allows ordinary developers, not just those skilled in high- performance computing, to use parallel programming constructs without worrying about the complex details of intra-cluster communication, task monitoring, and failure handling. MapReduce simplifies all that. The system splits the input data-set into multiple chunks, each of which is assigned a map task that can process the data in parallel. Each map task reads the input as a set of (key, value) pairs and produces a transformed set of (key, value) pairs as the output. The framework shuffles and sorts outputs of the map tasks, sending the intermediate (key, value) pairs to the reduce tasks, which group them into final results. MapReduce uses JobTracker and TaskTracker mechanisms to schedule tasks, monitor them, and restart any that fail. The Hadoop platform also includes the Hadoop Distributed File System (HDFS), which is designed for scalability and fault tolerance. HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and replicating the blocks on three or more servers. HDFS provides APIs for MapReduce applications to read and write data in parallel. Capacity and performance can be scaled by adding Data Nodes, and a single NameNode mechanism manages data placement and monitors server availability. HDFS clusters in production use today reliably hold petabytes of data on thousands of nodes. In addition to MapReduce and HDFS, Hadoop includes many other components, some of which are very useful for ETL. • Flume* is a distributed system for collecting, aggregating, and moving large amounts of data from multiple sources into HDFS or another central data store. Enterprises typically collect log files on application servers or other systems and archive the log files in order to comply with regulations. Being able to ingest and analyze that unstructured or semi-structured data in Hadoop can turn this passive resource into a valuable asset. Spring XD is one of the system similar to Flume. • Sqoop* is a tool for transferring data between Hadoop and relational databases. You can use Sqoop to import data from a MySQL or Oracle database into HDFS, run MapReduce on the data, and then export the data back into an RDBMS. Sqoop automates these processes, using MapReduce to import and export the data in parallel with fault-tolerance. • Hive* and Pig* are programming languages that simplify development of applications employing the MapReduce framework. HiveQL is a dialect of SQL and supports a subset of the syntax. Although slow, Hive is being actively enhanced by the developer community to enable low-latency queries on HBase* and HDFS. Pig Latin is a procedural programming language that provides high-level abstractions for MapReduce. You can extend it with User Defined Functions written in Java, Python, and other languages. • ODBC/JDBC Connectors for HBase and Hive are often proprietary components included in distributions for Hadoop software. They provide connectivity with SQL applications by translating standard SQL queries into HiveQL commands that can be executed upon the data in HDFS or HBase. • YARN provides cluster resource management capabilities to enable multiple data processing engines with multiple workloads & applications across a single clustered environment. Thus Hadoop is a powerful platform for big data storage and processing. 3
  8. 8. Architecture Design of the System Hadoop receives input structured and unstructured data from different sources hospitals, healthcare vaccines, social media, information document to its various platform. The features listed previously in feasibility section is depicted which is the core and HDFS nodes which can be scaled for storage. The output is the multiple application layers derived on the collated epileptic data in terms of audios, videos, documents, research publications and collaboration forums information from social media. We can also form data science to find out new research areas, to predict and do analytical reporting. Hospitals and Epileptic Patient’s Data Files- Epileptic Cases, Scenarios Social Media ETL ETL ETL Healthcare- Worldwide Epileptic Vaccines, Instruments ETL ETL Information Epilepsy Information And Knowledge Sharing HDFS Data Nodes Advanced Analytics Architecture Design of the System 4
  9. 9. Cloud Design of the Epilepsy Global Data Centre PAKISTAN UK INDIA US MALAYSIA SRILANKA EPILEPSY GLOBAL DATA CENTRES LEVERAGING CLOUD COMPUTING FEATURES Cloud is core to provide the infrastructure as a service (IAAS) to the Epilepsy Global Data Centre across the world. Volumes, Variety and Velocity being huge we can scale up the system automatically based on our data needs. Here the overhead of maintaining, upgrade, version management and the services of Hadoop, mail services, reporting is at the Cloud provider's end. Information sharing on epilepsy across different countries is achievable. We can create our customised services on "Epilepsy Data As a Service" for different clinical research, hospitals, doctors, neuroscientists, social media. Data volumes in terms of trillions and trillions of Zetabytes or more can be stored. However Cloud framework, network portability and components and legal matters, law across different countries will hold the key. The cloud is also used to provide extra capacity for an existing cluster or for test your Hadoop applications. Moreover Hortonworks Data Platform (HDP) 2.0 features the NameNode High Availability functionality automates failovers and ensures the availability of the full HDP stack. Cloud also leverages uses of multiple database platforms whether it is mysql, oracle, sqlserver or other databases. It also provides different reporting tools like Jasper, SAP Business Objects, Microstrategy, Qlikview to interface with the hadoop. Cloud is certainly a multi-use platform when coupled with BigData. Hadoop in the cloud makes a great deal of sense: the elastic resource allocation that cloud computing is premised on works well for cluster-based data processing infrastructure used on varying analyses and data sets of indeterminate size. 5
  10. 10. Data Storage Structure and Query Processing in HDFS and HBASE Data Storage Structure and Query Processing Flow in Hadoop Distributed File System (HDFS) and HBASE HDFS is a distributed file system that is well suited for the storage of large files. Data in HDFS is organized into files and directories and is stored in encrypted format. We cannot access the data like we do in our normal practice using the dir commands or explorer commands. Files are divided into uniform sized blocks and distributed across cluster nodes. Blocks are replicated to handle hardware failure. HDFS keeps checksums of data for corruption detection and recovery. Depending upon the configuration the files are broken into blocks of 128 MB. The blocks can be configured per file. The namenode manages the file namespace, authorisation, authentication. It collects blocks reports from datanodes based on block locations. It replicates the missing blocks in datanodes in case of failures. Datanodes handles thousands of block storage. It stores the blocks using the underlying OS's files. Client acess the blocks directly from data nodes based on the metadata read from namenode. MapReduce uses the FileSystem interface - hence it can run on multiple file systems. HDFS file system storage is depicted below. Metadata Hadoop Distributed File System Storage Structure 6
  11. 11. Sample java code to read the files in HDFS package org.myorg; import*; import java.util.*; import*; import org.apache.hadoop.fs.*; import org.apache.hadoop.conf.*; import*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; public class cat{ public static void main (String [] args) throws Exception{ try{ FileSystem fs = FileSystem.get(new Configuration()); FileStatus[] status = fs.listStatus(new Path("/hdfs/epilepsycases")); for (int i=0;i<status.length;i++) { BufferedReader br=new BufferedReader(new InputStreamReader([i].getPath()))); System.out.println(status[i]); String line; line=br.readLine(); while (line != null){ System.out.println(line); line=br.readLine(); } } }catch(Exception e){ System.out.println("File not found"); } } } [root@sandbox /]# hadoop jar epilepsy_case_files.jar > epilepsy_case_files.txt Here we can see the namenode, blocksize , replication mode, permissions. 7
  12. 12. HBase is designed as column stores. This is a more advanced form of a key-value pair database. Essentially, the keys and values become composite. Think of this as a hash map crossed with a multidimensional array. Essentially each column contains a row of data. It is ideally suited for semi-structured data since the MapReduce is very often used on these. The columns are naturally indexed and is good for scaling out horizontally. Imagine the difference between the RDBMS table having hundred columns and HBASE table having around 500 columns. However it is unsuited for complex data reads. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. A sample HBASE storage structure in contrast to SQL RDBMS table is depicted below. Firstname_lastname Doctorname_hospitalname Evaluation_date_Observations FirstName Lastname DoctorName HospitalName Surgical EvaluationDate Evaluation/ Observations PatientID Country PatientID_Country Key Value Column Family: CF_Data Primary Key Table Columns HBASE SQL (RDBMS) HBASE Storage Structure using Key Value Pair and SQL RDBMS Storage Structure 8
  13. 13. Use Cases Pool in social media data and analyse the information on epilepsy. This is aimed for self support care as well globally. In todays fast changing world there is a huge population on twitter, facebook, linkedin and we see a common synergy and huge exchange of information sharing. XD Engine Epilepsy Social Media (Twitter App) HADOOP - HDFS STREAM APP DATA INGESTDATA ANALYTICS PARSE UNSTRUCTURED DATA (JSON FORMAT) Streaming and Analysing Social Media Data Flow in Hadoop 9
  14. 14. Scenario This scenario is focused to stream unstructured data in real time from twitter app - Epilepsy Social Media and transform into useful information. Step 1: Create a collaboration forum app "Epilepsy Social Media" on the twitter Note down the API Keys, API secret, Access token and Access secret. In order to stream in information from Twitter, then we will need these necessary keys. Once we have the keys we configure the XD engine installed in Hadoop server. 10
  15. 15. Step2: Login to Spring XD engine under a separate shell from hadoop. Test whether hdfs is accessible or not. hadoop fs ls / It should display some files and directories Step 3 Create the tweet stream on collaboration forum in Spring XD stream create --name epilepsytweets --definition "twitterstream -- track='epilepsysociety, epilepsy society' | hdfs" 11
  16. 16. Step 4 Check whether we are able to stream files in xd hadoop fs -ls /xd/epilepsytweets 12
  17. 17. The tweets that were posted is listed in the files below screenshot. 13
  18. 18. JSON Data Format {"created_at":"Wed Mar 19 19:33:25 +0000 2014","id":446368866097065984,"id_str":"446368866097065984","text":"@epilepsysoc iety Hi we should build some ideas and come together to create awareness on epilepsy many countries mothers and fathers dont knw","source":"web","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_i d_str":null,"in_reply_to_user_id":87454049,"in_reply_to_user_id_str":"87454049","in_r eply_to_screen_name":"epilepsysociety","user":{"id":2387686938,"id_str":"238768693 8","name":"AnupSingh","screen_name":"anupsingh4u","location":"","url":null,"descriptio n":null,"protected":false,"followers_count":4,"friends_count":8,"listed_count":0,"created _at":"Thu Mar 13 19:00:48 +0000 2014","favourites_count":0,"utc_offset":null,"time_zone":null,"geo_enabled":false,"verifi ed":false,"statuses_count":8,"lang":"en- gb","contributors_enabled":false,"is_translator":false,"is_translation_enabled":false,"pro file_background_color":"C0DEED","profile_background_image_url":"http://abs.twimg.c om/images/themes/theme1/bg.png","profile_background_image_url_https":"https:/ /","profile_background_tile":false,"p rofile_image_url":" ile_0_normal.png","profile_image_url_https":" profile_images/default_profile_0_normal.png","profile_link_color":"0084B4","profile_sid ebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color": "333333","profile_use_background_image":true,"default_profile":true,"default_profile_i mage":true,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"c oordinates":null,"place":null,"contributors":null,"retweet_count":0,"favorite_count":0,"e ntities":{"hashtags":[],"symbols":[],"urls":[],"user_mentions":[{"screen_name":"epilep sysociety","name":"epilepsy society","id":87454049,"id_str":"87454049","indices":[0,16]}]},"favorited":false,"retwe eted":false,"filter_level":"medium","lang":"en"} 14
  19. 19. {"created_at":"Wed Mar 19 20:07:31 +0000 2014","id":446377448163143680,"id_str":"446377448163143680","text":"I'm fundraising for Epilepsy Society &amp; I'd love your support! Text HERB49 u00a32 to 70070 to sponsor me today. Thanks.","source":"u003ca href="" rel="nofollow"u003eTweet Buttonu003c/au003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_stat us_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_sc reen_name":null,"user":{"id":98352324,"id_str":"98352324","name":"Steven Herbert","screen_name":"sherbie40","location":"chepstow","url":null,"description":"Play the guitar til your fingers bleed, quoted by Ted Nugent..nnLifes to short get on with it...","protected":false,"followers_count":43,"friends_count":107,"listed_count":1,"create d_at":"Mon Dec 21 11:14:17 +0000 2009","favourites_count":1,"utc_offset":0,"time_zone":"London","geo_enabled":true,"ve rified":false,"statuses_count":119,"lang":"en","contributors_enabled":false,"is_translator ":false,"is_translation_enabled":false,"profile_background_color":"C0DEED","profile_bac kground_image_url":"","profi le_background_image_url_https":" bg.png","profile_background_tile":false,"profile_image_url":" ofile_images/442675380076703744/Oje9Ifzk_normal.jpeg","profile_image_url_https": " g","profile_banner_url":" 7010","profile_link_color":"0084B4","profile_sidebar_border_color":"C0DEED","profile_si debar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_imag e":true,"default_profile":true,"default_profile_image":false,"following":null,"follow_reque st_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors" :null,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"symbols":[],"urls": [{"url":"","expanded_url":" ven-Herbert","display_url":" Herbert","indices":[119,141]}],"user_mentions":[]},"favorited":false,"retweeted":false," possibly_sensitive":false,"filter_level":"medium","lang":"en"} 15
  20. 20. Step 5 Stop or undeploy the stream after collecting some data. stream undeploy --name epilepsytweets Step 6 Refine the Data using Hive Create the tables based on the streamed data collected in Hive. 16
  21. 21. We can see the tweets in hadoop interface has been brought into structured format. A report can be build on top of the same. 17
  22. 22. Use Cases Collect and represent the information on epilepsy types, symptoms, medicines and pros and cons of the same. Collect and represent the information on neurosurgeons, success scenarios handled, publications. Scenario: Collect doctors data from different hospitals and research centres The HIVE ETL script below in Hadoop will load the list of doctors data into warehouse. create table tbl_doctor ( id string, name string, age int, hospitalname string, expertise string, publications_link string, profile_info string, country string, city string) insert overwrite table tbl_doctor SELECT regexp_extract(col_value, '^(?:([^,]*),?){1}', 1) doctor_id, regexp_extract(col_value, '^(?:([^,]*),?){2}', 1) fullname, regexp_extract(col_value, '^(?:([^,]*),?){10}', 1) age, regexp_extract(col_value, '^(?:([^,]*),?){3}', 1) organisation, regexp_extract(col_value, '^(?:([^,]*),?){11}', 1) specialisation, regexp_extract(col_value, '^(?:([^,]*),?){8}', 1) articles_cited, regexp_extract(col_value, '^(?:([^,]*),?){13}', 1) wiki_profile, regexp_extract(col_value, '^(?:([^,]*),?){4}', 1) Country, regexp_extract(col_value, '^(?:([^,]*),?){5}', 1) City from temp_doctor; LOAD DATA INPATH '/user/hue/Doctors_List.csv' OVERWRITE INTO TABLE tbl_doctors We can customise our script based on the information received from hospitals and research centres. Columns position can be toggled For example if the specialisation field from list of of doctors of Hinduja hospital is at position 11 the we go by the below script. If the specialisation field from list of of doctors of Fortis hospital is at position 14 the we modify the below script for statement "regexp_extract(col_value, '^(?:([^,]*),?){14}', 1) specialisation". 18
  23. 23. Scenario: Build a catalog of epilepsy types and epilepsy medicines. HCATALOG provides easy interface to upload the files in different formats and set up the data. 19
  24. 24. Scenario Collect patients data related to his presurgical evaluation, medical history, physical examination and lab tests. The other tables are represented in the below. We can have customised ETL jobs based on the hospitals data. We can automate this process once we have the list of files. However it will be essential to encrypt and store the data or mask the data rather than revealing individual name. This will be subject to the healthcare laws of different nations. This scenario can be complimented by writing PIG scripts to compare data on epileptic patients across different states or countries. 20
  25. 25. Scenario Information can be shared easily on emails about the events to increase the awareness. Design the job in Oozie Editor/Dashboard 21
  26. 26. Conclusion and Recommendations The aim of this blend case is to increase networking amongst hospitals, doctors, people, childrens thus improving the healthcare systems. We can have proper data warehouse Kimball model as well as federated data warehouse in Hadoop. BigData is feasible for structured as well as unstructured data. Data across different testing methods, research is already available we can carry out data mining and able to predict on epileptic data. This will also aid to recognise the difference between the normal and abnormal flow on epileptic sufferers. Cognitive features on neural networking can be aimed to read the machine language of test carried out on epilepsy patients. Test data and their scenarios can be known upfront based on the parameters. Algorithms can be developed to make the system precision and agnostic. We can aim to build a language interpreter app which can share the epilepsy data primarily into different languages to the target audience across different countries. This will help in bridging the language barrier on communication between different languages spoken over the world. Document stores for CT scans, MRI, EEG recordings can be explored in MongoDB to optimize audio, video data. Interfacing with SAP HANA, SAP Business Objects, Microstrategy, Jasper. Qlikview and other reporting tools can be carried so that we can have the graphs and data representing a normal behaviour and deviated behaviour on seizures. 22
  27. 27. List of Abbreviations AWS - Amazon Web Services EMR - Elastic Map Reduce HDP - Hortonworks Data Platform EDW - Enterprise wide Datawarehouse HDFS - Hadoop Distributed File System IAAS - Infrastructure As A Service List of Figures Page 1: Hadoop Architecture Page 2: Architecture Design of the System Page 3: Epilepsy Global Data Centres Leveraging Cloud Computing Features Page 4: Data Storage Structure and Query Processing Flow in HDFS and HBASE Page 4: HDFS Storage Structure Page 8: HBASE Storage Structure 23
  28. 28. Literature References [1] [2] Moving To The Cloud. Developing Apps in the New World of Cloud Computing. Dinkar Sitaram. Geetha Manjunath. [3] [4] [5] [6] [7] Artificial Intelligence and Soft Computing: Behavioral and Cognitive Modeling of the Human Brain, Volume 1 By Amit Konar [8] Computational Intelligence: Principles, Techniques and Applications By Amit Konar [9] [10] [11] [12] [13] [14] Dr. Vrajesh Udani. details.aspx?did=140&name=dr-vrajesh-udani&cid=36&cname= [15] [16] [17] Jayapandian CP, Chen CH, Bozorgi A, Lhatoo SD, Zhang GQ, Sahoo SS. Electrophysiological Signal Analysis and Visualization using Cloudwave for Epilepsy Clinical Research. The 14th World Congress on Medical and Health Informatics (MedInfo), 2013. [18] Hadoop Architecture analytics-turning-big-data-into-intelligence.html 24