SlideShare a Scribd company logo
Konstantinos Christofilos
OK. What is the problem ?
In order to get data from each service you
have
to speak its language (API)
What can we do about that ?
We can create a repository of mixed services data and query that to
produce more complex results
How are we going to do that ?
How are we going to do that ?
Step 1 – Generate endpoints
How are we going to do that ?
Step 2 – Load data
Step 1 – Generate endpoints
Importer interface
Example (Facebook page):
The command that generates endpoints takes as input a text file with
each name in a single line
Step 2 – Load data
Input
Output
Step 2 – Load data (Input)
Example (Facebook page)
Step 2 – Load data (Output)
Neo4j
Apache Jena
Example
A simple example running three names: Kostas Christofilos, Iannis Kotidis, Vasilis Spiropoulos
Example
Neo4j database size: 240 KB
Apache Jena database size: 75.7 KB
Scaling ?
In order to be easily scaled, the application is designed to handle
Inputs and Outputs as APIs.
That approach gives the ability for a horizontal scale.
Process
Gather Data
• We took the names of world’s greatest leaders from Fortune magazine
http://fortune.com/worlds-greatest-leaders/
• Application queried the APIs for accounts that are related to these
names
• It build a list of endpoints for those names
• Data was fetched from those endpoints
• Entities were recognized
• Data was saved into two different databases (Apache Jena, Neo4j)
Process
The names that were used are the following:
Jeff Bezos, Angela Merkel, Aung San Suu Kyi, Pope Francis, Tim Cook,
John Legend, Christina Figueres, Paul Ryan, Ruth Bader Ginsburg,
Sheikh Hasina, Nick Saban, Huateng "Pony" Ma, Sergio Moro, Bono,
Stephen Curry, Steve Kerr, Bryan Stevenson, Nikki Haley, Lin-Manuel
Miranda, Marvin Ellison, Reshma Saujani, Larry Fink, Scott Kelly,
Mikhail Kornienko, David Miliband, Anna Maria Chavez, Carla Hayden,
Maurizio Macri, Alicia Garza, Patrisse Cullors, Opal Tometi, Chai Jing,
Moncef Slaoui, John Oliver, Marc Edwards, Arthur Brooks, Rosie Batty,
Kristen Griest, Shaye Haver, Denis Mukwege, Christine Lagarde, Marc
Benioff, Gina Raimondo, Amina Mohammed, Domenico Lucano, Melinda
Gates, Susan Desmond-Hellman, Arvind Kejriwal, Jorge Ramos, Michael
Froman, Mina Guli, Ramon Mendez, Bright Simons, Justin Trudeau,
Clare Rewcastle Brown, Tshering Tobgay
Process
The list of the previous names after it was parsed from the application
produced a list of more than 11.000 endpoints in Facebook, Instagram
and Twitter with the following distribution.
Names Facebook
Page
Endpoints
Twitter
Endpoints
Instagram
Endpoints
Total
Endpoints
56 437 866 9903 11206
Process
The parse of those endpoints in a single workstation* took about 14
days for the period 2016-06-04 to 2016-06-18.
Most of the time was consumed in the entity recognition process
*Workstation specs: AMD FX 2-Core CPU, 4GB RAM, 120GB SSD, Linux
OS with 5 concurrent processes of the application running
Results
The application run a single pass over the generated endpoints witch
took about 14 days in a single workstation and the generated nodes
were 126,395.
Endpoints Machines Generated
Nodes
Time Average
nodes/day/machi
ne
11206 1 126395 14 days 9028
Results
60
65
70
75
80
85
90
95
100
Neo4j Apache Jena
Import (%)
NER (%)
Data import time distribution (Empty database)
Results
80
82
84
86
88
90
92
94
96
98
100
Neo4j Apache Jena
Import (%)
NER (%)
Data import time distribution (100,000+ nodes)
Results
0 5000 10000 15000 20000 25000
Jesus
Lebron
Stephen
Steph
Curry
Persons mentions
Person Mentions
Results
0 1000 2000 3000 4000 5000 6000 7000 8000
Padre
GSW
Santo
Cavs
NBA
Organization Mentions
Organization Mentions
Results
0 500 1000 1500 2000 2500 3000 3500 4000
Jordan
America
Hermosa
Venezuela
Cleveland
Location Mentions
Location Mentions
Conclusion
• Data are generated in vast amounts every moment.
• We created an approach of linking heterogeneous data in a single
repository.
• Generated data can be analyzed and produce combined results.
• Patterns can be identified from that repository.
Future extensions
• New Inputs can be implemented (new APIs)
• New Outputs can be implemented (new storage engines)
• Name list can be saved in a database and accessed from all nodes. Now
it’s a local file
• Queue for entity recognizer can be implemented and remove the
blocking code
• Centralized logging for cluster monitoring. Now log goes to STDOUT
REST APIs
• REST stands for Representational State Transfer
• REST ignores the details of component implementation
• REST is an architecture that enables applications to communicate
without knowing the underlying technology
• REST APIs promote easy reusability
REST APIS
Graph Databases
• Graph database is a database type that uses graph structures with
nodes, edges and properties to represent and store data.
• Graph databases are based on graph theory and employs nodes, edges
and properties
• Nodes represent entities such as people, businesses, accounts, or any
other item you might want to keep track of
• Edges, also known as graphs or relationships, are the lines that connect
nodes to other nodes and represent relationship between them
Graph Databases
Graph Databases
Resource Description Framework (RDF)
• RDF is a standard model for data interchange on the Web and was
specified by W3C
• Web is a graph, created by nodes, edges and relations
Graph Databases
Property Graphs
• Property Graph databases are graph databases that contains connected
entities, which can cold, any number of attributes
• Nodes can be tagged with labels representing their different roles
• Labels may also serve to attach metadata to certain nodes
Graph Databases
Property Graphs
Natural Language Processing (NLP)
• Natural language processing (NLP) is an area of research and
application that explores how computers can be used to understand
and manipulate natural language text or speech to do useful things
• NLP lie in a number of disciplines, information sciences, linguistics,
mathematics, electrical and electronic engineering, artificial intelligence
and robotics, psychology, etc
• Applications of NLP include a number of fields of studies, such as
machine translation, natural language text processing and
summarization, user interfaces, multilingual and cross language
information retrieval, speech recognition, artificial intelligence and
expert systems
Natural Language Processing (NLP)
Named Entity Recognition (NER)
Named entity recognition (NER) is a subtask of information extraction
that seeks to locate and classify named entities in text into pre-defined
categories such as the names of persons, organizations, locations,
expressions of time etc. NER systems use linguistic grammar-based
techniques as well as statistical models, i.e. machine learning.
We used Stanford Named Entity Recognizer that was created by The
Stanford Natural Language Processing Group.
http://nlp.stanford.edu/software/CRF-NER.shtml
Natural Language Processing (NLP)
Named Entity Recognition (NER)

More Related Content

Viewers also liked

Dsc0779
 Dsc0779 Dsc0779
Dsc0779cprt34
 
Exercici2_AinaCrespí_B3
Exercici2_AinaCrespí_B3Exercici2_AinaCrespí_B3
Exercici2_AinaCrespí_B3ainacrespiroig
 
The lesser gods of olympus
The lesser gods of olympusThe lesser gods of olympus
The lesser gods of olympus
Carmela Blanquer
 
RTKGPS SolarPanel (Vestern Bokn 1999)
RTKGPS SolarPanel (Vestern Bokn 1999)RTKGPS SolarPanel (Vestern Bokn 1999)
RTKGPS SolarPanel (Vestern Bokn 1999)Bo Voers
 
Medhufushi Reference Letter
Medhufushi Reference LetterMedhufushi Reference Letter
Medhufushi Reference LetterGyan Ranjan
 
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
Copla Opleiding Training en Consultants
 
스마트창작터 사업계획서
스마트창작터 사업계획서스마트창작터 사업계획서
스마트창작터 사업계획서
상훈 심
 
Education
EducationEducation
Education
Valeriy Platonov
 
Baldo
BaldoBaldo

Viewers also liked (11)

Mind map!
Mind map!Mind map!
Mind map!
 
Diapo de taller
Diapo de tallerDiapo de taller
Diapo de taller
 
Dsc0779
 Dsc0779 Dsc0779
Dsc0779
 
Exercici2_AinaCrespí_B3
Exercici2_AinaCrespí_B3Exercici2_AinaCrespí_B3
Exercici2_AinaCrespí_B3
 
The lesser gods of olympus
The lesser gods of olympusThe lesser gods of olympus
The lesser gods of olympus
 
RTKGPS SolarPanel (Vestern Bokn 1999)
RTKGPS SolarPanel (Vestern Bokn 1999)RTKGPS SolarPanel (Vestern Bokn 1999)
RTKGPS SolarPanel (Vestern Bokn 1999)
 
Medhufushi Reference Letter
Medhufushi Reference LetterMedhufushi Reference Letter
Medhufushi Reference Letter
 
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
Congres Dare to Share for Safety- Walter Zwaard 29 september 2016
 
스마트창작터 사업계획서
스마트창작터 사업계획서스마트창작터 사업계획서
스마트창작터 사업계획서
 
Education
EducationEducation
Education
 
Baldo
BaldoBaldo
Baldo
 

Similar to Repository for data crawled from multiple social networks

FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Carole Goble
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
Karry Lu
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
Peter Skomoroch
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links.
Fabien Gandon
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
StampedeCon
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
Max De Marzi
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?
Srinath Perera
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Neo4j
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
Paul Houle
 
Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10
Mario Cho
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
Anandh Arumugakan
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
Tao Xie
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
Machine Learning Prague
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
andrea huang
 
OpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers OverviewOpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers Overview
Kingsley Uyi Idehen
 
Web Topics
Web TopicsWeb Topics
Web Topics
Praveen AP
 

Similar to Repository for data crawled from multiple social networks (20)

FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
ODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For GoodODSC East 2017: Data Science Models For Good
ODSC East 2017: Data Science Models For Good
 
Rapid Data Exploration With Hadoop
Rapid Data Exploration With HadoopRapid Data Exploration With Hadoop
Rapid Data Exploration With Hadoop
 
Neo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperativeNeo4j GraphDay Seattle- Sept19- Connected data imperative
Neo4j GraphDay Seattle- Sept19- Connected data imperative
 
On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links. On the many graphs of the Web and the interest of adding their missing links.
On the many graphs of the Web and the interest of adding their missing links.
 
Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015Graph Database Use Cases - StampedeCon 2015
Graph Database Use Cases - StampedeCon 2015
 
Graph database Use Cases
Graph database Use CasesGraph database Use Cases
Graph database Use Cases
 
What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?What Open Data and Open Source can do for Sri Lanka?
What Open Data and Open Source can do for Sri Lanka?
 
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...
 
Paul houle resume
Paul houle resumePaul houle resume
Paul houle resume
 
Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10Koss 1605 machine_learning_mariocho_t10
Koss 1605 machine_learning_mariocho_t10
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
Software Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and SecuritySoftware Analytics: Data Analytics for Software Engineering and Security
Software Analytics: Data Analytics for Software Engineering and Security
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Xuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent ApplicationsXuedong Huang - Deep Learning and Intelligent Applications
Xuedong Huang - Deep Learning and Intelligent Applications
 
How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?How to clean data less through Linked (Open Data) approach?
How to clean data less through Linked (Open Data) approach?
 
OpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers OverviewOpenLink Virtuoso - Management & Decision Makers Overview
OpenLink Virtuoso - Management & Decision Makers Overview
 
Web Topics
Web TopicsWeb Topics
Web Topics
 

Recently uploaded

How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
abdulrafaychaudhry
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
Roshan Dwivedi
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
Georgi Kodinov
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Yara Milbes
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
takuyayamamoto1800
 

Recently uploaded (20)

How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Lecture 1 Introduction to games development
Lecture 1 Introduction to games developmentLecture 1 Introduction to games development
Lecture 1 Introduction to games development
 
Launch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in MinutesLaunch Your Streaming Platforms in Minutes
Launch Your Streaming Platforms in Minutes
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Vitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdfVitthal Shirke Java Microservices Resume.pdf
Vitthal Shirke Java Microservices Resume.pdf
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx2024 RoOUG Security model for the cloud.pptx
2024 RoOUG Security model for the cloud.pptx
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaTop 7 Unique WhatsApp API Benefits | Saudi Arabia
Top 7 Unique WhatsApp API Benefits | Saudi Arabia
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 

Repository for data crawled from multiple social networks

  • 2. OK. What is the problem ? In order to get data from each service you have to speak its language (API)
  • 3. What can we do about that ? We can create a repository of mixed services data and query that to produce more complex results
  • 4. How are we going to do that ?
  • 5. How are we going to do that ? Step 1 – Generate endpoints
  • 6. How are we going to do that ? Step 2 – Load data
  • 7. Step 1 – Generate endpoints Importer interface Example (Facebook page): The command that generates endpoints takes as input a text file with each name in a single line
  • 8. Step 2 – Load data Input Output
  • 9. Step 2 – Load data (Input) Example (Facebook page)
  • 10. Step 2 – Load data (Output) Neo4j Apache Jena
  • 11. Example A simple example running three names: Kostas Christofilos, Iannis Kotidis, Vasilis Spiropoulos
  • 12. Example Neo4j database size: 240 KB Apache Jena database size: 75.7 KB
  • 13. Scaling ? In order to be easily scaled, the application is designed to handle Inputs and Outputs as APIs. That approach gives the ability for a horizontal scale.
  • 14. Process Gather Data • We took the names of world’s greatest leaders from Fortune magazine http://fortune.com/worlds-greatest-leaders/ • Application queried the APIs for accounts that are related to these names • It build a list of endpoints for those names • Data was fetched from those endpoints • Entities were recognized • Data was saved into two different databases (Apache Jena, Neo4j)
  • 15. Process The names that were used are the following: Jeff Bezos, Angela Merkel, Aung San Suu Kyi, Pope Francis, Tim Cook, John Legend, Christina Figueres, Paul Ryan, Ruth Bader Ginsburg, Sheikh Hasina, Nick Saban, Huateng "Pony" Ma, Sergio Moro, Bono, Stephen Curry, Steve Kerr, Bryan Stevenson, Nikki Haley, Lin-Manuel Miranda, Marvin Ellison, Reshma Saujani, Larry Fink, Scott Kelly, Mikhail Kornienko, David Miliband, Anna Maria Chavez, Carla Hayden, Maurizio Macri, Alicia Garza, Patrisse Cullors, Opal Tometi, Chai Jing, Moncef Slaoui, John Oliver, Marc Edwards, Arthur Brooks, Rosie Batty, Kristen Griest, Shaye Haver, Denis Mukwege, Christine Lagarde, Marc Benioff, Gina Raimondo, Amina Mohammed, Domenico Lucano, Melinda Gates, Susan Desmond-Hellman, Arvind Kejriwal, Jorge Ramos, Michael Froman, Mina Guli, Ramon Mendez, Bright Simons, Justin Trudeau, Clare Rewcastle Brown, Tshering Tobgay
  • 16. Process The list of the previous names after it was parsed from the application produced a list of more than 11.000 endpoints in Facebook, Instagram and Twitter with the following distribution. Names Facebook Page Endpoints Twitter Endpoints Instagram Endpoints Total Endpoints 56 437 866 9903 11206
  • 17. Process The parse of those endpoints in a single workstation* took about 14 days for the period 2016-06-04 to 2016-06-18. Most of the time was consumed in the entity recognition process *Workstation specs: AMD FX 2-Core CPU, 4GB RAM, 120GB SSD, Linux OS with 5 concurrent processes of the application running
  • 18. Results The application run a single pass over the generated endpoints witch took about 14 days in a single workstation and the generated nodes were 126,395. Endpoints Machines Generated Nodes Time Average nodes/day/machi ne 11206 1 126395 14 days 9028
  • 19. Results 60 65 70 75 80 85 90 95 100 Neo4j Apache Jena Import (%) NER (%) Data import time distribution (Empty database)
  • 20. Results 80 82 84 86 88 90 92 94 96 98 100 Neo4j Apache Jena Import (%) NER (%) Data import time distribution (100,000+ nodes)
  • 21. Results 0 5000 10000 15000 20000 25000 Jesus Lebron Stephen Steph Curry Persons mentions Person Mentions
  • 22. Results 0 1000 2000 3000 4000 5000 6000 7000 8000 Padre GSW Santo Cavs NBA Organization Mentions Organization Mentions
  • 23. Results 0 500 1000 1500 2000 2500 3000 3500 4000 Jordan America Hermosa Venezuela Cleveland Location Mentions Location Mentions
  • 24. Conclusion • Data are generated in vast amounts every moment. • We created an approach of linking heterogeneous data in a single repository. • Generated data can be analyzed and produce combined results. • Patterns can be identified from that repository.
  • 25. Future extensions • New Inputs can be implemented (new APIs) • New Outputs can be implemented (new storage engines) • Name list can be saved in a database and accessed from all nodes. Now it’s a local file • Queue for entity recognizer can be implemented and remove the blocking code • Centralized logging for cluster monitoring. Now log goes to STDOUT
  • 26. REST APIs • REST stands for Representational State Transfer • REST ignores the details of component implementation • REST is an architecture that enables applications to communicate without knowing the underlying technology • REST APIs promote easy reusability
  • 28. Graph Databases • Graph database is a database type that uses graph structures with nodes, edges and properties to represent and store data. • Graph databases are based on graph theory and employs nodes, edges and properties • Nodes represent entities such as people, businesses, accounts, or any other item you might want to keep track of • Edges, also known as graphs or relationships, are the lines that connect nodes to other nodes and represent relationship between them
  • 30. Graph Databases Resource Description Framework (RDF) • RDF is a standard model for data interchange on the Web and was specified by W3C • Web is a graph, created by nodes, edges and relations
  • 31. Graph Databases Property Graphs • Property Graph databases are graph databases that contains connected entities, which can cold, any number of attributes • Nodes can be tagged with labels representing their different roles • Labels may also serve to attach metadata to certain nodes
  • 33. Natural Language Processing (NLP) • Natural language processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things • NLP lie in a number of disciplines, information sciences, linguistics, mathematics, electrical and electronic engineering, artificial intelligence and robotics, psychology, etc • Applications of NLP include a number of fields of studies, such as machine translation, natural language text processing and summarization, user interfaces, multilingual and cross language information retrieval, speech recognition, artificial intelligence and expert systems
  • 34. Natural Language Processing (NLP) Named Entity Recognition (NER) Named entity recognition (NER) is a subtask of information extraction that seeks to locate and classify named entities in text into pre-defined categories such as the names of persons, organizations, locations, expressions of time etc. NER systems use linguistic grammar-based techniques as well as statistical models, i.e. machine learning. We used Stanford Named Entity Recognizer that was created by The Stanford Natural Language Processing Group. http://nlp.stanford.edu/software/CRF-NER.shtml
  • 35. Natural Language Processing (NLP) Named Entity Recognition (NER)