4. Data Modeling and Queries
~Elasticsearch
1.Index
• A collection of documents that have somewhat similar characteristics
• Corresponding to ‘database’ in Relational Database.
2.Type
• logical category/partition of your index whose semantics is completely up to you
• Corresponding to ‘table’ in Relational Database.
3.Document
• A basic unit of information that can be indexed
• Corresponding to ‘row’ in Relational Database.
5. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
6. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
7. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
8. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
9. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
10. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in
10 mins
= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
11. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in
10 mins
= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
• Stratified Sampling
~tags
~posted_at(month)
12. question_id tags
answer_time(
sec)
posted_at Random_sq
231 Java 3010
2016_01_02_21_2
0_01
11_10
290 spark 7381
2016_01_02_22_0
9_01
11_28
341 Java 5611
2016_01_10_01_0
2_05
11_31
Data Modeling and Queries
stackover/questions:
index type Document
• Prob. of a question labeled with specific tag(such as ‘java’) and answered in
10 mins
= number of questions answered in 10 mins and tagged with ‘java’
/ total number of questions tagged with ‘java’
• Stratified Sampling
~tags
~posted_at(month)
17. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java
Tag num
Java 1
18. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 1
19. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 1
1
20. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 1
1
21. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 2
1
22. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Tag num
Java 1
JVM 2
spark 1
1
Spark
1
23. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Spark
Tag num
Java 2
JVM 2
spark 1
1
1
24. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Spark
Tag num
Java 2
JVM 2
spark 1
sql 1
1
1
Sql
25. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Spark
Tag num
Java 2
JVM 2
spark 1
sql 1
1
1
Sql
1
26. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Spark
Tag num
Java 2
JVM 3
spark 1
sql 1
2
1
Sql
1 1
27. userid tags
231 [“Java”,”JVM"]
290 [“JVM”,”Spark”]
341 [“Java”,”sql”,”JVM”]
Data Modeling and Queries
stackovergraph/userstags:
Java JVM
Spark
Tag num
Java 2
JVM 3
spark 1
sql 1
2
1
Sql
1 1
28. Data Modeling and Queries
Tag num
Java 2
JVM 3
spark 1
sql 1
Recommend tags for users:
Java JVM
Spark
2
1
Sql
1 1
Proportion of people who can answer “B” question in people who can answer “A” question
=weight of edge AB / number of people who have answered “A” question
=Similarity of “A” to “B”
29. Data Modeling and Queries
Tag num
Java 2
JVM 3
spark 1
sql 1
Recommend tags for users:
Java JVM
Spark
2
1
Sql
1 1
30. Data Pipeline
Historical
data(60G)
Streaming data
1.Computing how long it takes to get answer for each question
2.Based on sampling fraction ,generating random number
3.Computing what types of questions which each user has answered
(constructing graph)
1.Sampling data
2.Computing prob.
3.searching neighbors
31. About Me
• Chentao(Sam) Zhang
• MS in Electrical & Computer
Engineering from University of
Delaware
• Passionated to learn and try
new things