Computational intelligence for big data analytics bda 2013

959 views

Published on

Big Data Analytics - a presentation by Dr Dom Heger at the 2013 International Conference on Knowledge, Innovation and Enterprise, London UK

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
959
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Computational intelligence for big data analytics bda 2013

  1. 1. Big Data Analytics: Challenges and l h ll d What Computational Intelligence Techniques May Offer h ff Ah-Hwee Tan (http://www.ntu.edu.sg/home/asahtan) School of Computer Engineering Nanyang Technological University Big Data Analytics Symposium London, UK 13 September 2013
  2. 2. Outline  Big Data Analytics  Computational Intelligence Techniques  Web Data Analytics  Flexible Organizer for Competitive Intelligence (FOCI)  Web Information Fusion and Associative Discovery Di  Analytics for Active Living for Elderly
  3. 3. The Era of Big Data Big data refers to collection of data sets so large and complex that th t exceed th competence of commonly used d the t f l d IT systems in terms of processing space and/or time. time
  4. 4. Sources of Big Data g • Traditionally, mostly produced in scientific fields such as astronomy, meteorology astronomy meteorology, genomics physics biology and physics, biology, environmental research. • With rapid development of IT technology and the p p gy consequent decrease of cost on collecting and storing data, big data has been generated from almost every industry and sector as well as governmental department department, including retail, finance, banking, security, audit, electric power, healthcare. • Recently, big data over the Web (big Web data for short), which includes all the context data, such as, user generated contents, browser/search log data deep web contents data, data, etc.
  5. 5. Examples of Big Data (Source: Wikipedia) • Walmart handles more than 1 million customer transactions every h hour, which i i hi h is imported i t d t b t d into databases estimated t ti t d to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress. • Facebook handles 50 billion photos from its user base. • FICO Falcon Credit Card Fraud Detection System protects 2.1 billion active accounts world-wide. • Windermere Real Estate uses anonymous GPS signals from nearly 100 million drivers to help new home buyers yp determine their typical drive times to and from work throughout various times of the day.
  6. 6. Examples of Big Data (Source: Wikipedia) • NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster. • Utah Data Center is a data center currently c rrentl being constr cted b the constructed by United States National Security Agency. When finished, the facility will handle yottabytes of information collected by NSA over the Internet. Value Metric 1000 kB kilobyte 10002 MB megabyte 10003 GB gigabyte 10004 TB terabyte 10005 PB petabyte 10006 EB exabyte 10007 ZB zettabyte 10008 YB yottabyte
  7. 7. Money of Big Data (Source: Wikipedia) • "Big data" have increased the demand of information g management specialists • Software AG, Oracle Corporation, IBM, Microsoft, SAP, EMC, d SAP EMC and HP h have spent more than $15 billion t th billi on software firms specializing in data management and analytics. y • In 2010, this industry on its own was worth more than $100 billion and was growing at almost 10 percent a year: about twice as fast as the software business as a whole.
  8. 8. Market of Big Data (Source: Wikipedia) • Developed economies make increasing use of datadata intensive technologies. There are 4.6 billion mobilephone subscriptions worldwide and there are between 1 billion and 2 billion people accessing the internet • The world's effective capacity to exchange information through telecommunication networks was 281 petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes in 2000, 65 exabytes in 2007[14] and it is predicted that the amount of traffic flowing over the internet will reach 667 exabytes annually by 2013.[5]
  9. 9. Big Data Market Segments (Report by Transparency Market Research) • Segmentation of the big data market by components, by g g y p , y applications and by geography. • The different components included are software and services, hardware and storage. • Software and services segment dominates the components market whereas storage segment will be the fastest growing segment for the next 5 years owing to the perpetual growth in th d t generated. t l th i the data t d
  10. 10. Big Data Market Segment by Applications • Covered eight applications namely financial services, manufacturing, healthcare, telecommunication, government, retail and media & entertainment and others in the application segment. • Financial Services, healthcare and the government sector are the top three contributors of the big data market and together held more than 55% of the big data market in 2012. • M di and E t t i Media d Entertainment and th h lth t d the healthcare sectors will t ill grow at high CAGR of nearly 42% from 2012 to 2018. The g growth in data in the form of video, images, and g g games is driving the media and entertainment segment. Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ
  11. 11. Challenges of Big Data • Volume – Size in the order of petabytes, exabytes, … • Velocity – Time sensitive data, data that g grow exponentially or even in p y rates that overwhelm the wellknown Moore's Law Value Metric 1000 kB kilobyte 10002 MB megabyte 10003 GB gigabyte i b t 10004 TB terabyte 10005 PB petabyte 10006 EB exabyte 10007 ZB zettabyte 10008 YB yottabyte • V i t Variety – From structured data into semi-structured and completely unstructured data of different types such as types, text, image, audio, video, click streams, log files,
  12. 12. Deeper Issues of Big Data (The additional 3Vs) • Validity – Is the data correct and accurate for the intended usage? • V Veracity i – Are the results meaningful for the given problem space? • Volatility – How long do you need to look/store this data?
  13. 13. Computational Intelligence • Neural Networks (IJCNN) – Brain-like mathematical models for pattern recognition, memory, and association discovery – Examples: Perceptron, BP, SVM, SOM, ART, … • Fuzzy Systems (IEEE-FUZZ) – Fuzzy operators for handling non-discrete reasoning – Examples: FNN, Fuzzy C-Means, …
  14. 14. Computational Intelligence • Evolutionary Computing (CEC) – Classes of heuristic algorithms repeatedly search for good solutions by mimicking g y g the process of natural evolution – Commonly used for optimization and search problems – Examples: Genetic Algo, Memetic Algo,
  15. 15. Flagship Events of Computational Intelligence • World Congress on Computational Intelligence (Australia 2012, Beijing 2014) y p p g • IEEE Symposium on Computational Intelligence (Singapore 2013, Florida, USA 2014) • IEEE Symposium on Computational Intelligence in Big Data (IEEE CIBD'2014)
  16. 16. Examples of Use of CI in Big Data • • • • • • • • • • • • • • Data size and feature space adaptation Uncertainty modeling in learning from big data Distributed learning techniques in uncertain environment Uncertainty in cloud computing Distributed Di ib d parallel computation ll l i Feature selection/extraction in big data Sample selection based on uncertainty Incremental Learning Manifold Learning on big data Uncertainty techniques in big data classification/clustering Imbalance learning on big data Active learning on big data Random weight networks on bi d t R d i ht t k big data Transfer learning on big data
  17. 17. Self-Organizing N S lf O i i Neural l Networks for Personalized W b Intelligence P li d Web I t lli Towards Personalized Web Intelligence g Ah-Hwee Tan, Hwee-Leng Ong, Hong Pan, Jamie Ng, Qiu-Xiang Li Knowledge and Information Systems 18 (2004) 297-306
  18. 18. Workflow for Web Data Analytics y • Search – Getting the information • Organize (clustering/categorizing) – Putting things in perspectives • Analyze (data mining) – Discover hidden knowledge • Share (knowledge management) – Saving for reference and sharing • Track – Constant monitoring
  19. 19. Approaches to Organizing/Analyzing • Cl stering Clustering – Organizing information into groups based on similarity functions and thresholds – e.g. BullsEye, NorthernLight, Vivisimo • Categorization g – Organizing information into a “predefined” set of classes – e.g. Yahoo!, Autonomy Knowledge Server • Which is better?
  20. 20. Clustering g • Pros – Unsupervised/self-organizing, require no training or predefinition of classes – Able to identify new themes • Cons – Users have no control – Ever changing cluster structure – Difficult to navigate and track
  21. 21. Categorization g • Pros – Good control on classes – Every info assigned to one or more classes of interests • Cons –R Require l i learning ( i (supervised) and/or i d) d/ definition of classification rules/knowledge – Every info has to be assigned to one or more classes – Good control but lack flexibility to handle new information
  22. 22. User-configurable Clustering (Tan & Pan, PAKDD 2002) Pan • Information organization and content organi ation management • Online incremental clustering + user userdefined structure (preferences) • Reduces to a clustering system if no user indication given • Allows personalization in a direct direct, intuitive, and interactive manner • Control + flexibility
  23. 23. ARAM for Personalized Information Management Information Clusters F2 b F1 a F1 a a  - x b x b + + Information Vector - A B Preference Vector
  24. 24. Flexible Organizer for Competitive Intelligence (FOCI) • A platform for gathering, organizing, tracking, analyzing, and sharing competitive information • Natural way of turning raw search results into personalized CI portfolios – Multilingual enabled – with Multilingual Efficient Analyzer g y – Domain localization (Technology) • Patented and licensed to many companies
  25. 25. FOCI User Interface
  26. 26. FOCI Architecture Intranet/ Internet User’s CI Portfolio Domain-Specific Knowledge Content Management Content Publishing g Content Analysis Visu ualization Front End d Content Gathering
  27. 27. Personalized Content Management g • Portfolio created through Search f S • Unsupervised clustering (ARAM Pattern Channel A) • Loop – Personalization by users (ARAM Pattern Channel B) – Reorganization of clusters (ARAM Pattern Channel A&B) • Saving of personalized portfolio • Tracking of new information
  28. 28. Personalization Functions • Marking/labeling (selected) clusters – Personal interpretation • Inserting Clusters – Indicate preference on groupings • Merging clusters – Indicate preferences on similarities • Splitting clusters – Indicate preferences on differences • ...
  29. 29. Information Clustering g • A portfolio created by a meta-search of y 4 search engines with a query on “Text Mining”
  30. 30. A Personalized Portfolio after <=19 personalization operations p p (mainly labeling and creating clusters)
  31. 31. Organizing New Information g g Without the Personalized Portfolio 42 new documents from DirectHit, Netscape, and BusinessWire B i Wi Based on Personalized Portfolio
  32. 32. Summary y • A fusion neural network algorithm, called fusion ART, has been proposed for integrating clustering and categorization • Has been applied to competiti e intelligence on the web. competitive eb • Comparing with advantages in existing works, fusion ART has – Personalization— fusion ART performs analysis and organization of data based on user preferences – Low time complexity — f fusion ART performs real-time search and f match of patterns resulting in a linear time complexity – Incremental clustering manner — fusion ART may adapt to dynamic web multimedia d d i b l i di data set b i by incrementally clustering new ll l i patterns based on the learnt cluster structure without referring to the old data. 3 2
  33. 33. Heterogeneous Data Co-clustering for Social Media Data Theme Discovery and Mining Lei Meng, Ah-Hwee Tan and Dong Xu g g IEEE Transactions on Knowledge and Data Engineering, 2013 33
  34. 34. Introduction • The popularity of social websites leads to greatly p p y g y increase of web multimedia documents – Massive number – Billions of images and articles online – Diversity – Diverse content and booming emerging topics – Multi-modal descriptors – images, text, category, tags, Keywords comments Category Birds Images from Wild, bird, beach, Surrounding tree, vacation, text animal, mar, sunny, playa, nayarit, arena,ave, water, vacaciones, i hollyday, pelicano. 34
  35. 35. Introduction • Clustering of web multimedia data is challenging – – – – Scalability bi d S l bili to big data Difficulty in integrating multi-modal feature data Ambiguity in deciding the number of categories Rich but noisy meta-information – semantic gap of images, noisy tags Birds Bi d Wild, bird, beach, tree, vacation, animal, mar animal mar, sunny, playa, nayarit, arena, ave, water, vacaciones, hollyday, pelicano. Beach B h Ocean, blue, sea, summer, vacation, sun, man, b h beach, water, yellow, fun, sand, p y play, funny, y adult, humor, lifestyle, sunny, resort. 35
  36. 36. Problem Statement We define the theme discovery of web multimedia data as a h heterogeneous d data co-clustering problem, which l i bl hi h identifies the semantic categories of data patterns through the fusion and recognition of multiple types of features. Multiple Apple Apple Descriptions Category Fruits Products Movies Tag User Description Surrounding text …… …… …… 36
  37. 37. Proposed Approach p pp • A self-organizing neural network approach to Heterogeneous Data Co-clustering  Based on Fusion Adaptive Resonance Theory (Fusion ART)  Fuse arbitrary number of feature modalities  Adaptively tune the weights for different feature modalities  Two different learning function for primary data, such as images and articles, and meta-information to handle short and nois text noisy te t  Incremental fast learning  D not need to give the number of clusters Do d i h b f l 37
  38. 38. Experiments • NUS-WIDE data set – 36784 images of 18 categories – Visual features: Grid color moment, Edge direction histogram, and wavelet texture – T t l features of surrounding text: 1142 words (7 words per image on Textual f t f di t t d d i average) • 20 Newsgroups data set g p – 12826 text documents of 10 categories – Textual features of document content: over 60k words (800 words per document on average) – Textual features of category: 3 labels per document on average 38
  39. 39. Experiments on NUS-WIDE Data Set • Evaluation on weight adaptation across channels for visual and textual features – Performance Comparison with fixed weight values • GHF-ART with the adaptively tuned weight values γ_SA achieves the best performance in 5 classes and the overall performance, and achieves close performance with the best results obtained by fixed weight values 39
  40. 40. Experiments on NUS-WIDE Data Set – Tracking of the change in weight values of γ _SA • Textual features of surrounding text are assigned higher weights than visual features • The value of γ SA s b es in [0.7, 0.8] with the increase of patterns e v ue o γ_S stabilizes [ .7, . ] w e c e se o p e s • Big fluctuation may be resulted by the generation of new clusters 40
  41. 41. Experiments on NUS-WIDE Data Set • Clustering Performance comparison with existing algorithms in terms of weighted average precision cluster entropy (H cluster) class entropy ( H class ) precision, ), ), l purity and rand index (RI) • GHF-ART achieves the best performance in terms of all the evaluation measures • With supervisory information, GHF-ART(SS) consistently obtains better performance 41
  42. 42. Experiments on NUS-WIDE Data Set • Time complexity analysis – GHF-ART and Fusion ART incur very small increase of time cost – For 23284 images, GHF-ART complete the clustering process in 10 seconds 42
  43. 43. Experiments on 20 Newsgroups Data Set p g p • Clustering performance comparison using document content and category information d t i f ti – Both GHF-ART and GHF-ART(SS) outperform other algorithms in all the evaluation measures – GHF ART has a 5% gain than Fusion ART in terms of Average GHF-ART Precision, Purity and Rand Index. – Comparing with other unsupervised algorithms, GHF-ART achieves around 80% in Average Precision, Purity and Rand Index while other Precision algorithms typically obtain less than 75% 43
  44. 44. Summary y • A Heterogeneous data co-clustering algorithm, called GHFART, ART is proposed to discover the themes of web multimedia data via their rich but heterogeneous descriptors. • Comparing with existing works GHF ART has advantages in works, GHF-ART – Strong noise immunity — A learning function of meta-information is proposed to handle noise – Ad ti channel weighting — A well-defined weighting algorithm i Adaptive h l i hti ll d fi d i hi l i h is proposed to identify the important feature modalities for a better fusion of multi-modal features for overall similarity measure; – L Low ti time complexity — GHF ART performs real-time search and match l it GHF-ART f l ti h d t h of patterns resulting in a linear time complexity for big data; – Incremental clustering manner — GHF-ART may adapt to dynamic web multimedia d t set b i b lti di data t by incrementally clustering new patterns b d t ll l t i tt based on the learnt cluster structure without referring to the old data. 44
  45. 45. Research Centre of Excellence in Active LIving for th ld LY A ti LI i f the elderLY (LILY) Aging in Place: Opportunities and Challenges Ah-Hwee Tan ( p (http://www.ntu.edu.sg/home/asahtan) g ) School of Computer Engineering Nanyang Technological University JOINT UBC-NTU RESEARCH CENTRE
  46. 46. Aging in Place g g “the ability to live in one's own home and community safely, independently, and comfortably, regardless of age, income, or ability level” - Center for Disease Control, Dec 2011 , 46
  47. 47. Motivation  Global aging population creates silver challenges  Most adults would prefer to age in place  78 percent of adults between the ages of 50 and 64 report that they would prefer to stay in their current residence as they age  Growing elderly population will be living independently in own homes g  Vital to transform future homes into intelligent human-centered environment for the elderly  Golden opportunities for innovating assistive technologies f aging i place h l i for i in l 47
  48. 48. A Basic Scenario of Tender Care for Agingin-place p  Unobtrusive Sensing  Social Signal Processing g  Context Aware Auto Tagging  Social Cognitive Network Unobtrusive sensing device detects: the elder keeps walking around at an irregular pace. Social signal processing indicates: the elder has been silent for an unusually long time. Cognitive Analysis result… lt Your mother may be feeling anxious now… now I need to call my y mother now…
  49. 49. Silver Challenges g 49
  50. 50. Vision To T enable elderly t maintain an active, h lth and bl ld l to i t i ti healthy d engaging life style in their own homes supported by an age-friendly intelligent environment, providing allg y g p g round comprehensive tender care  Round-the-clock day-to-day health and wellness monitoring i i  Cognitive Support and recommendation to products and services  Companionship and emotional support  Support for maintaining/stimulating social interaction 50
  51. 51. Design Consideration and Challenges  How to perform unobtrusive monitoring? - Mobile sensing, activity tracking  How to provide all-around comprehensive care? all around - Physical, cognitive, emotional, social, sustainability  How to maintain ubiquitous access q interaction? - Cross platform, multimedia, multimodal  How to provide friendly, personal touch? - Adaptive user modeling, mood detection and -P Proactive, natural i i l interaction i 51
  52. 52. Approach and Methodology pp gy To support active living of elderlies pp g f through an intelligent multi-agent environment with ubiquitous access, natural interface, and allrounded comprehensive care dd h i Key Technologies     Unobtrusive sensing and social signal processing Activity pattern and user modeling Information and service recommendation Proactive stimulation and natural interaction 52
  53. 53. A Multi-Agent Collaborative Care Environment Isabel (Personal Nurse) Small talk Recommendations for healthcare products and services Alfred Alf d (The Butler) Small talk User modeling Social and travel advisory Frank (Robot Dog) Activity sensing Pattern modeling 53
  54. 54. Why Multi-Agent? y g  Unobtrusive sensing and monitoring – agents of different characteristics and capabilities  Ubi i Ubiquitous access to information and i f i d services – agents in different platforms and locations  Comprehensive tender care – agents with different domain knowledge and functions diff d i k l d df i  “Three’s a party” – more opportunities for p y pp cognitive stimulation and social interaction 54
  55. 55. Comprehensive Tender Care  Physical Support – Activity tracking, safety and tracking wellness monitoring C Cognitive S i i Support – i f information and i d recommendation on (healthcare) products, services, skills and activities k nd ct v t  Emotional Support – mood detection, affective support, small talk t ll t lk  Social Support – companionship and connection to family and friends (old and new) through sms, emails and facebooks etc 55
  56. 56. Unobtrusive Sensing and Ubiquitous Access to Services unobtrusive in-home real-time data collection and contextual social signal processing - Essential to better understand and cater to the elderly’s needs. ld l ’ d  Sensing – bio sensing, motion sensors, wearable/mobile sensors for health monitoring and activity tracking  Cross Platform – Large screen interactive display, mobile handheld devices, physical robots  Multimedia – text, audio, video 56
  57. 57. Adaptive User Modelling p g  Identity and profile  Interests and preferences  Behaviour model: Ti Time, space, activity p ti it  Knowledge and skills  S i l network: Family and f d Social k l d friends Meth0ds for Model Building  Explicit: User specification  Implicit: User actions, choices, conversation 57
  58. 58. Cognitive Support: Product/Service Recommendation  Domain knowledge: Healthcare, Travel, Cooking  Delivery modes: - Question & Answer -P Proactive recommendation i d i - Conversation P Personal T h l Touch: Personalized, Context sensitive, small talks 58
  59. 59. Challenges in g g y Big Living Analytics  Volume – huge amount of data through bio sensing, motion sensors, wearable/mobile sensors for health monitoring and activity tracking  Velocity – 24x7 real time sensing, sense making, decision making service recommendation making,  Variety – information integration and knowledge sharing from cross platform, multimedia h i f l f l i di unstructured data - text, audio, video, gestures 59
  60. 60. Research Centre of Excellence in Active LIving for the elderLY (LILY) LI LY Thank you! JOINT UBC-NTU RESEARCH CENTRE

×