Invited talk at the national institute of astronomy and geophysics - Helwan on Wed. 21 October 2014 on Data is the new oil: Big data, data mining and bio - inspiring techniques
Data is the new oil: Big data, data mining and bio - inspiring techniques
1. Professor Aboul Ella Hassanien
Chair of the Scientific Research Group in Egypt (SRGE)
http://www.egyptscience.net
Dean of faculty of computers and information – Bein Suef University
Web site: http://www.fci.cu.edu.eg/~abo/
Face book: http://www.facebook.com/profile.php?id=100000780092307
Research gate: http://www.researchgate.net/home.Home.html
CU scholar: http://scholar.cu.edu.eg/abo
2. Agenda
• Scientific Research Group in Egypt (SRGE)
• SRGE Trends AND Directions
Information and Network Security (Geospatial Data 2D/3D))
Biomedical Informatics (Biomedical/Bioinformatics)
Intelligent Technology for blind and deaf people
Intelligent Environment and Control System
Data mining, graph mining and Social Networks (image and
data Registration/fusion in remote sensing)
• Big Data Set and Complex System
• Data Mining and Intelligent systems
• Open Discussion
4. Scientific Research Group in Egypt (SRGE)
Members
• 1 Professor
• 15 Assistant Professors
• 20 Ph.D students
• 25 M. Sc. students
• 50 International collaborative
researchers from 15 countries
• 10 undergraduate student
50
45
40
35
30
25
20
15
10
5
0
SRGE member numbers.
no.
20 Faculties and institutes
5. Scientific Research Group in Egypt (SRGE)
Objective
• To encourage and make it easy for the Egyptian young researchers to
cooperate and increase their contribution in academic research.
• To integrate the various research efforts of the scientific team to be a source
of innovation on possible scientific, technological and socio-economic
trajectories to mould the future of machine intelligence technologies and
applications.
• To produce Master/PhD graduates:
Who can conduct high quality academic research,
Who can publish their research in high quality academic journals,
Who can obtain tenure track faculty positions at high ranking research universities,
Who are good teachers, and more generally who are good academics
7. Scientific Research Group in
Egypt (SRGE)
Publications (2013-2014(
• 2013: more than 100 publications
▫ 32 (ISI) Journal papers
Elsevier AND Springer and other prestigious
Journals
▫ 60 International Conferences
IEEE/Springer
▫ Book Chapter
10 book chapters (Springer)
▫ Editing Book
Five (Springer)
▫ Editing Proceeding
One
▫ Special issues
THEE
8. Scientific Research Group in Egypt (SRGE)
SRGE research tracks (2013-2014)
• Track-(1) Network and information security
• Track-(2) Biomedical eng. & Bioinformatics
• Track- (3) Intelligent environment and applications
• Track- (4) Iintelligent technology for disable people
• Track- (5) Chem(o)informatics
• Track- (6) Social networks/ Big Data and graph mining
9. Scientific Research Group in
Egypt (SRGE)
Research tracks
• Track-I Network and Information Security
▫ Intrusion Detection System
(Machine Intelligence, Danger theory. AIS)
▫ Cryptanalysis (Evolutionary optimization)
▫ Image Authentication and Applications
▫ Watermarking (vector and raster data)
▫ Digital Signatures
▫ Biometrics
Heart sound recognition
Face and Finger print
Gait processing
Problems:
- Heart Sound as a biometric
- Watermarking (Vector data)
- Image authentication
- Asymmetric hash function
- Multi-Biometric-based
11. Network and Information Security
Blind Source Separation (ICA)
Blind Source Separation (BSS) deals with the problem of separating
independent sources from their observed mixtures only while both
the mixing process and original sources are unknown.
Blind Separation of Information from
Galaxy Spectra
0 50 100 150 200 250 300 350
1.4
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
Early diagnosis of pathology in fetus
12. Network and Information Security
Vector geo-spatial data /3D animated
Watermarking
Geospatial data 3D animated object
• Geospatial data or geographic
information is the data or information
that identifies the geographic location
of features and boundaries on Earth,
such as natural or constructed features,
oceans, and more.
13. Scientific Research Group in
Egypt (SRGE)
Research tracks
• Track-II Biomedical Informatics
▫ Medical image processing
Breast cancer analysis, (sonar, MRI, fMRI, CT)
Liver fibrosis and tumour analysis (biopsy, MRI, CT)
Medical image annotation
• Bioinformatics
Problems:
- Breast Cancer Case
- Liver Fibrosis – HCV
- Content-based image retrieval
- Formal Concept Analysis (visualize (rule based))
14. Track-II Biomedical Informatics
Hepatitis C Virus in Egypt -HCV*
The World Health
Organization has decleared
hepatitis C a global health
problem, with approximately
3% of the world’s population
(roughly 170-200 million
people) infected with HCV.
Egypt has one of the highest
prevalence rates of the C virus
in the world
In Egypt the situation is quite
worse.
EGYPT: 14.7 % infected with Hepatitis C
15. Track-II Biomedical Informatics
Liver Fibrosis
Stage 0 No fibrosis (fatty
liver)
Stage 1 Portal expansion with
fibrosis (<1/3 area)
Stage 2 Bridging fibrosis
(>1/3)
Stage 3 Marked bridging
fibrosis or early cirrhosis ( no
reason for tissue conversion)
Stage 4 Definite cirrhosis
(<50% of biopsy fibrosis)
Stage 5 Definite cirrhosis
(>50% of biopsy fibrosis)
Challenges: distinguish between the late fibrosis stage and tumor
Good segmentation techniques/features-based/classifier/
16. Scientific Research Group in
Egypt (SRGE)
Research tracks
• Track-III : Intelligent Environment
• Intelligent Water/Air Quality Monitoring
• Smart Reading Environments
• Intelligent Lighting system
• Video Processing
(Video annotation/summarization)
Problems:
- Monitoring Water/air Pollutions
- Climate Change
18. Intelligent environment and applications track
Cattle identification
• Identify the origin of each animal;
• Trace the path of each animal from
location to location;
• Trace each animal exposed to disease;
• Eradicate or control an animal health
threat;
• Retrieve information within hours of an
outbreak and implement intervention
strategies;
• Enhance the safety and security of the
food chain;
• Improve consumer confidence; and,
• Facilitate efficient market transactions as
it provides assurance to buyers regarding
the animals life history.
19. Arabian horse
Track –III Intelligent environment and applications
Arabian Horse identification using Iris pattern
The Arabian horse is
a breed of horse that
originated on the Arabian
Peninsula. It is one of the
oldest breeds, dating back
4,500 years.
Recent developments in iris
scanning have led to a new form of
equine identification, and research
has indicated that the horse's eye
could be the most telling identifier.
21. Scientific Research Group in
Egypt (SRGE)
Research tracks
• Track-IV : The intelligent technology
for blind and visual impaired people
▫ Text to speech processing
▫ Document management for blind and visual
impairment people
▫ Developing Games for blind and visual
impairment people
▫ Mobil applications for blind and visual
impairment people
▫ Automatic Sign Language (ASL) Recognition
for Deaf-Blind people
23. • Tongue Drive System
• For disabled people,
technology may do more
than just improve their
lives - high-tech tools may
give them life back.
• Researchers from the Gergia
Institute of Technology
created the latest device, a
mouth retainer that allows
people with spinal cord ( إصابات
في النخاع الشوكي )injuries to
operate a computer and move
an electric wheelchair with only
their tongues
24. Thought-Controlled Wheelchair
• Users wear a cap that can read
brain signals. Those signals
are then relayed to a brain
scan electroencephalograph
(EEG) on the wheelchair
which are then analyzed by a
computer program and sent to
the wheelchair. Toyota said its
next goal is to allow users to
think about letters in order to
spell words.
Analysis brain signals?
25. Big Data in Complex System
Data is the new oil
26.
27. Simple to start
• What is the maximum file size you
have dealt so far?
▫ Movies/Files/Streaming video that
you have used?
▫ What have you observed?
• What is the maximum download
speed you get?
• Simple computation
▫ How much time to just transfer.
28. What is big data?
• 90% of the data in the world today has
been created in the last two years alone.
• This data comes from everywhere:
▫ sensors used to gather climate
information,
▫ posts to social media sites,
▫ digital pictures and videos,
▫ Cell phone GPS signals to name a few.
This data is “big data.”
29. Big Data Born
• Google, eBay, LinkedIn, and
Facebook were built around
Big Data from the beginning.
• No need to integrate Big Data
with more traditional sources
of data and the analytics
performed upon them
• No merging Big Data
technologies with their
traditional IT infrastructures
• Big Data could stand alone,
Big Data analytics could be the
only focus of analytics
30. What is Big Data?
• Big Data is a term applied to data
sets whose size is beyond the ability
of commonly used software tools to
capture, manage, and process the
data within a tolerable elapsed time.
• Big Data sizes are a constantly
moving target currently ranging
from a few dozen terabytes to
many petabytes of data in a single
data set. –Wikipedia, October 2014
(http://en.wikipedia.org/wiki/Big_da
ta)
31. Huge amount of data
• There are huge volumes of data in the
world:
+ From the beginning of recorded time
until 2003,
+We created 5 billion gigabytes
(exabytes) of data.
+ In 2011, the same amount was
created every two days
+ In 2013, the same amount of data is
created every 10 minutes.
32. How much data?
• Google processes 20 PB a day
(2008)
• Wayback Machine has 3 PB + 100
TB/month (3/2009)
• Facebook has 2.5 PB of user data +
15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50
TB/day (5/2009)
33. Big data spans three dimensions:
Volume, Velocity and Variety
34. Big data spans three dimensions: Volume,
Velocity and Variety
• Volume:
▫ Enterprises are awash with ever-growing data of
all types, easily amassing terabytes—even
petabytes—of information.
Turn 12 terabytes of Tweets created each day
into improved product sentiment analysis
35. Big data spans three dimensions: Volume,
Velocity and Variety
• Velocity:
• Sometimes 2 minutes is too late. For time-sensitive processes
such as catching fraud, big data must be used as it streams into
your enterprise in order to maximize its value.
▫ Analyze 500 million daily call detail records in real-time to
predict customer churn faster
The latest I have heard is 10
Nano seconds delay is too
much.
36. Big data spans three dimensions: Volume,
Velocity and Variety
• Variety:
▫ Big data is any type of data - structured and unstructured data
such as text, sensor data, audio, video, click streams, log files
and more. New insights are found when analyzing these data
types together.
Monitor 100’s of live video feeds from surveillance cameras
to target points of interest
Exploit the 80% data growth in images, video and documents
to improve customer satisfaction
37. Time for thinking
• What do you do with the data.
▫ Lets take an example:
“From application developers to video streamers, organizations
of all sizes face the challenge of capturing, searching, analyzing,
and leveraging as much as terabytes of data per second—too
much for the constraints of traditional system capabilities and
database management tools.”
38. Finally….
`Big- Data’ is similar to ‘Small-data’ but bigger
.. But having data bigger it requires different
approaches:
Techniques, tools, architecture
… with an aim to solve new problems
Or old problems in a better way
39. What to do with these data?
• Aggregation and Statistics
▫ Data warehouse and OLAP
• Indexing, Searching, and Querying
▫ Keyword based search
▫ Pattern matching (XML/RDF)
• Knowledge discovery
▫ Data Mining
▫ Statistical Modeling
41. What is Data mining?
• Data mining (knowledge
discovery from data)
▫ Extraction of
interesting (non-trivial,
implicit, previously
unknown and
potentially useful)
patterns or knowledge
from huge amount of
data
• Alternative names
▫ Knowledge discovery
(mining) in databases
(KDD), knowledge
extraction, business
intelligence, etc.
42. 42
Why Not Traditional Data Analysis?
• Huge amount of data
▫ Algorithms must be highly scalable to handle such as
Tera-bytes of data
• High-dimensionality of data
▫ Micro-array may have tens of thousands of dimensions
• High complexity of data
▫ Data streams and sensor data
▫ Time-series data, temporal data, sequence data
▫ Structure data, graphs, social networks and multi-linked data
▫ Heterogeneous databases and legacy databases
▫ Spatial, spatiotemporal, multimedia, text and Web data
▫ Software programs, scientific simulations
43. 43
Multi-Dimensional View of Data Mining
• Data to be mined
▫ Relational, data warehouse, transactional, stream, object-oriented/
relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
• Knowledge to be mined
▫ Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
▫ Multiple/integrated functions and mining at multiple levels
• Techniques utilized
▫ Database-oriented, data warehouse (OLAP), machine learning, statistics,
visualization, etc.
• Applications adapted
▫ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
market analysis, text mining, Web mining, etc.
44. 44
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
▫ Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
▫ Data streams and sensor data
▫ Time-series data, temporal data, sequence data (incl. bio-sequences)
▫ Structure data, graphs, social networks and multi-linked data
▫ Object-relational databases
▫ Heterogeneous databases and legacy databases
▫ Spatial data and spatiotemporal data
▫ Multimedia database
▫ Text databases
▫ The World-Wide Web
45. 45
Data Mining Functionalities
• Multidimensional concept description: Characterization and discrimination
▫ Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet
regions
• Frequent patterns, association, correlation vs. causality
▫ Diaper Beer [0.5%, 75%] (Correlation or causality?)
• Classification and prediction
▫ Construct models (functions) that describe and distinguish classes or
concepts for future prediction
E.g., classify countries based on (climate), or classify cars based on (gas
mileage)
▫ Predict some unknown or missing numerical values
46. 46
Data Mining Functionalities (2)
• Cluster analysis
▫ Class label is unknown: Group data to form new classes, e.g., cluster houses
to find distribution patterns
▫ Maximizing intra-class similarity & minimizing interclass similarity
• Outlier analysis
▫ Outlier: Data object that does not comply with the general behavior of the
data
▫ Noise or exception? Useful in fraud detection, rare events analysis
• Trend and evolution analysis
▫ Trend and deviation: e.g., regression analysis
▫ Periodicity analysis
▫ Similarity-based analysis
• Other pattern-directed or statistical analyses
47. 47
Data Mining Functionalities (1)
Basic Data Mining Tasks
• Classification maps data into predefined
groups or classes
▫ Supervised learning
▫ Pattern recognition
▫ Prediction
• Clustering groups similar data together into
clusters.
▫ Unsupervised learning
▫ Segmentation
▫ Partitioning
48. 48
Data Mining Functionalities (2)
Basic Data Mining Tasks
• Summarization maps data into subsets with
associated simple descriptions.
▫ Characterization
▫ Generalization
• Link Analysis uncovers relationships among
data.
▫ Affinity Analysis
▫ Association Rules
▫ Sequential Analysis determines sequential patterns.
49. 49
Architecture: Typical Data Mining System
Graphical User Interface
Pattern Evaluation
Data Mining Engine
Database or Data
Warehouse Server
data cleaning, integration, and selection
Knowl
edge-
Base
Database
Data
Warehouse
World-Wide
Web
Other Info
Repositories
53. 53
Example: Information Retrieval
• Information Retrieval (IR): retrieving desired
information from textual data.
• Library Science
• Digital Libraries
• Web Search Engines
• Traditionally keyword based
• Sample query:
Find all documents about “data mining”.
DM: Similarity measures;
Mine text/Web data.
54. 54
Information Retrieval (cont’d)
Similarity: measure of how close a query is •
to a document.
• Documents which are “close enough” are
retrieved.
Metrics: •
▫ Precision = |Relevant and retrieved|
|Retrieved|
▫ Recall = |Relevant and Retrieved|
|Relevant|
55. Intelligent Systems
Bio inspiring system
Biologically inspired computing
relies heavily on the fields of
biology, computer science and
mathematics. Recommender
system
56. Artificial Immune system (AIS)
• AIS are adaptive systems, inspired by theoretical
immunology and observed immune functions,
principles and models, which are applied to
problem solving
• Applications
▫ Bioinformatics
▫ Intrusion detection
▫ Virus detection
57. Swarm Intelligent
Definition:
-is an artificial intelligence technique based around the study of
collective behavior in decentralized, self-organized systems
-SI systems are typically made up of a population of simple agents
interacting locally with one another and with their environment.
Goals:
-performance optimization and robustness
-self-organized control and cooperation (decentralized)
-division of labour and distributed task allocation
58. Swarm Intelligent Techniques
• Ant Colony Optimization (ACO)
• Marriage in Honey Bees Optimization (MBO)
• Particle Swarm Optimization (PSO).
Fish Swarm school
59. Ant Colony Optimization
• Ant Colony Optimization is an
efficient method to finding
optimal solutions to a graph
• Using three algorithms based on
choosing a city, updating
pheromone trails and
pheromone trail decay, we can
determine an optimal solution to a
graph
• Ant Colony Optimization has
been used to figure out solutions
to real world problems, such as
truck routing
61. Ant Colony Optimization Cont.
• Many difficult optimization problems have been solved
by so-called ant algorithms such as
- The Traveling Salesman Problem.
- The Quadratic Assignment Problem
- Other hard optimization problems .
• These different approaches all try to take advantage of
how social insects seem to function.
63. Marriage in Honey Bees Optimization Cont.
The main processes in MBO are:
(1) the mating flight of the queen bee with drones
(2) the creation of new broods by the queen bee
(3) the improvement of the broods' fitness by workers.
(4) the adaptation of the workers' fitness
(5) the replacement of the least fittest queen(s) with the fittest brood(s).
64. Particle Swarm Optimization (PSO).
• PSO method is motivated from the simulation of social
behavior of bird flocking and fish schooling
65. Particle Swarm Optimization Cont.
• In PSO, each single solution is a "bird" in the search
space. We call it "particle".
• All of particles have
▫ fitness values which are evaluated by the fitness
function to be optimized, and
▫ velocities which direct the flying of the particles.
• The particles fly through the problem space by
following the current optimum particles.
66. Swarm Intelligent Application
• Swarm Robotics
• Crowd simulation
• Ant-based routing
• Telecommunication (routing and congestion
problems, intrudion detection)
• Computer Animation
• Electronic
• Data Mining
• Production control
• Industrial Design
67. Swarm robotics (e.g.: Swarm-bots)
• Collective task completion
• No need for overly complex algorithms
• Adaptable to changing environment
68. Communication Networks
• Routing packets to
destination in shortest time
• Similar to Shortest Route
• Statistics kept from prior
routing (learning from
experience)