4. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
4
5. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy? Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
5
6. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
6
7. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
7
8. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
8
9. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
But can you tell me
anything useful
about this data set?
9
10. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Sure.These are easy
to see:
• Highest score
• Lowest score
• Total observations
10
11. LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Not-so-easy
• How many people?
• Who’s doing the best?
• Who’s doing the worst?
• How are individuals doing?
11
12. HOW ABOUT NOW?
What Changed?
• The data’s still tidy,but we’ve changed
the organizing principle
Name Score Day
Allen 25 1
Mary 11 1
Joe 1 1
Mary 14 2
Joe 14 2
Joe 17 3
Allen 9 3
Mary 9 3
12
13. OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
13
14. OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
This data’s NOTTIDY but...
14
15. OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
This data’s NOTTIDY but...
I can eyeball it easily
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
15
16. OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
This data’s NOTTIDY but...
I can eyeball it easily
And new questions become
interesting and easier to
answer
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
16
17. OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
• How many students are there?
• Who improved?
• Who missed a test?
• Who was kind of meh?
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
17
18. DON’T GET MAD
I’m not saying to kill tidy Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
18
19. DON’T GET MAD
I’m not saying to kill tidy
But I worry we don’t use certain methods
more often because it’s not as easy as it
could be.
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
19
20. BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE…
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
20
21. BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE
THESE
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
21
22. BEFORE I GOT INTO THE ’NOSQL’
MINDSET I SIGHED THINKING ABOUT…
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
• Consumer researchDo people like things because they
like them or because of the ordering they saw them in?
22
23. I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
23
24. I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
24
25. I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
• Especially deep finding: humans are lazy
25
26. Option 1:
>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))
>>> no_sql_df
Name
Allen [25, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
26
IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
27. IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 2:
>>> new_list = []
>>> for tuple in df.groupby(['Name']):
... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})
...
>>> new_list
[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1,
11), (3, 9)]}]
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
27
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
28. IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 3:
>>> def process(new_df):
... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day'])
else None for i in range(1,4)]
...
>>> df.groupby(['Name']).apply(process)
Name
Allen [25, None, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
28
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
29. LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
29
30. LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
30
31. LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
• The unit of an observationshould be the actor not the particular
action observedat a particular time.
31
32. LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
• The unit of an observationshould be the actor not the particular
action observedat a particular time.
• Maybe we should rethink what we mean by‘observations’
32
33. • High scalability
• Distributed computing
• Schema flexibility
• Semi-structured data
• No complex relationships
• Schema change all the time
• Patterns change all the time
• Same units of interest
repeating new things
33
We don’t look for No-SQL because we have No-SQL databases...We
have No-SQL Databases because we have no-SQL data.
34. WHAT IS NO-SQL PYTHON?
Data that doesn’t seem like it fits in a data frame
• Arbitrarily nested data
• Ragged data
• Comparativetime series
34
35. WHERE DO WE FIND NO-SQL DATA?
Here’s where I’ve found it…
• Physics lab
• Running data
• Health data
• Reddit
35
37. GETTING THE DATA INTO PYTHON
WHEN IT’S STRAIGHTFORWARD
• Scenario:you’re grabbing bunch of NoSQL data from an
API or from a NoSQL db.
• We’ll stick with JSON since it’s a common format.
• Best case scenario.You’ll take everythinghowever you can
get it. In this case stick with pandas.The normalize_json
works great.
37
42. NORMALIZE_JSON WORKS PRETTY
WELL
42
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
"measurement_value": 0.97
},
{
"day": 1,
"measurement_value": 1.55
},
{
"day": 2,
"measurement_value": 0.67
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works
43. NORMALIZE_JSON WORKS PRETTY
WELL
43
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
"measurement_value": 0.97
},
{
"day": 1,
"measurement_value": 1.55
},
{
"day": 2,
"measurement_value": 0.67
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process
Easy to add columns
>> normalized_data['length'] =
normalized_data['series'].apply(len)
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works
44. USING SOME PROGRAMMER STUFF ALSO HELPS
44
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
45. USING SOME PROGRAMMER STUFF ALSO HELPS
45
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
46. USING SOME PROGRAMMER STUFF ALSO HELPS
46
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
• You should probably subclass __init__
as well for the case of inconsistent
format
47. USING SOME PROGRAMMER STUFF ALSO HELPS
47
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
• You should probably subclass __init__
as well for the case of inconsistent
format
• Then __call__ can be a catch-all
adjustable function...best to loadit up
with a call to a class function, which
you can adjust at-will anytime.
48. CUSTOM CLASSES PAIR NICELY WITH
CLASS METHODS
48
class Test:
def __init__(self, name)
self.name1 = name
def print_class_instance(instance):
print(instance.name1)
def print_self(self):
self.__class__.print_class_instance(self)
>>> test1 = Test('test1')
>>> test1.print_self()
test1
>>> def new_printing(instance):
... print("Now I'm printing a constant string")
...
>>> test1.print_self()
test1
>>> Test.print_class_instance = new_printing
>>> test1.print_self()
Now I'm printing a constant string
• Design flexible classes that often
reference class methods rather
than instance methods
• Then as you are processing data,
you can quickly swap out methods
to call different field names in the
event of highly nested JSON
• Data processing is faster and no
mental gymnastics or annoying
parse efforts required
50. SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
50
51. SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
51
52. SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
• Apply() is your best friend
52
53. SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python string
operations to clean up the data
• Apply() is your best friend
• Common problems:spaces between “,” and column name or column value (df =
pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this
problem
53
54. SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
55. SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
56. SOMETIMES YOU GET WEIRD CSV FILES…
56
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
Hmmm….
>> print(df['favorites'][0][1])
>> m
57. SOMETIMES YOU GET WEIRD CSV FILES…
57
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
Hmmm….
>> print(df['favorites'][0][1])
>> m
Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing
>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())
>> print(df['favorites'][0][1])
>> adele
58. WHAT ABOUT THIS ONE?
58
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward?
Actually this fails miserably
>> print(df['favorites'])
>> joe [madonna elvis u2]
mary [lady gaga adele] 36
Name: name, dtype: object
We need more regex…this time before applying read_csv()....
60. 60
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically,put in a the quotation marks to help out read_csv()
61. 61
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically,put in a the quotation marks to help out read_csv()
With multiple arrays per row,you’re gonna need to accommodate the greedy nature of regex
pattern = "([.*?])"
63. SOMETIMES YOU GET JSON AND YOU KNOW
THE STRUCTURE,YOU JUST DON’T LIKE IT
• Use json_normalize()and then shed columns you don’t want.You’ve seen that
today already (slides 32-38).
• Use some magic: sh with jq module to simplify your life…you can pick out the
fields you want with jq either on the command line or with sh
• jq has a straightforward,easy to learn syntax:. = value,[] = array operation,etc… 63
cat = sh.cat
jq = sh.jq
rule = """[{name: .samples[].name, days: .samples[].series[].day}]""”
out = jq(rule, cat(_in=json_data)).stdout
json.loads(uni_out)
64. AND SOMETIMES YOU HAVE NO IDEA
WHAT’S IN AN ENORMOUS JSON FILE
• Inconsistent or undocumentedAPI
• Legacy Mongo database
• Someone handed you some gnarly JSON because they
couldn’t parse it
64
65. YOU’RE A PROGRAMMER…USE ITERATORS
• The ijson module is an iterator JSON parser…you can
deal with structure one bit at a time
• This also gives you a great opportunity to make data
parsing decisions as you go
• This isn’t fast,but it’s also not fast to shoot from the hip
when you’re talking about gnarly JSON
65
66. YOU’RE A PROGRAMMER…USE ITERATORS
66
with open(file_name, 'rb') as file:
results = ijson.items(file, "samples.item")
for newRecord in results:
record = newRecord
for k in record.keys():
if isinstance(record[k], dict().__class__):
recursive_check(record[k])
if isinstance(record[k], list().__class__):
recursive_check(record[k])
process(record)
67. YOU’RE A PROGRAMMER…USE ITERATORS
67
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a
space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically,you can build custom classes
or generate appropriate named tuples
as you go.
• This lets you know what you have and
lets you build data structures to
accommodate what you have.
• Storing these objects in a class rather
than simple dictionary again gives you
the option to customize .__call__()
to your needs
68. YOU’RE A PROGRAMMER…USE ITERATORS
68
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically, you can build custom classes or
generate appropriate Named tuples as you go.
This lets you know what you have and lets you
build data structures to accommodate what you
have.
• Again remember that class methods can easily
be adjusted dynamically,so it’s good to code
classes with instances that reference class
methods.
70. CLUSTERING TIME SERIES
• Reports of clustering and classifying time
series are surprisingly rare
• Methods are computationallydemanding
O(N2)… but we’re getting there
• Relatedly‘classification’ can also be used for
series-related predictions
• Can use many commonly applied clustering
algorithms once you get the distance metric
70
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
72. WHEN DO PEOPLE GO RUNNING?
72
Actually,I made these plots with R…
73. NANO-SCALE PHYSICS
73
Meisner et al, J. Am. Chem. Soc. 2012, 134, 20440−20445
• You can build an electrical circuit which has
a single molecule as its narrowest part
• It turns out it’s quite easy to distinguish
different molecules depending on their
trajectory as you pull on them
• Particularly their summed behavior looks
quite different
• Suggests that we could cluster and identify
individual measurements with reasonable
certainty
74. 74
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
REDDIT
75. 75
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
REDDIT
76. 76
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
• Some kinds of posts don’t last long (r/TwoX and r/videos)
• r/personalfinance shows a remarkable ability to have a
second peak/second life on the front page
• r/videos do great but burn out quickly
REDDIT
78. QUICK: HOW IT WORKS II.
78
• O(N2) in theory
• Various lower bounding techniques
significantly reduce processing time
• Dynamic programmingproblem
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
79. WHY THE FANCY METHOD?
79
Euclidean distance matches ts3
to ts1,despite our intuition that
ts1 and ts2 are more alike.
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
http://nbviewer.jupyter.org/github/alexmi
nnaar/time-series-classification-and
clustering/blob/master/Time%20Series%
20Classification%20and%20Clustering.ipy
nb
82. WHY?
Time series classification and related metrics can be one more thing to
know…or even several more things to know
82
Name Ordered
Scores
Score
Trajectory Type
Number of
Tests
Predicted
Score For
Next Test
Joe [1, 14, 17] good 3 19
Mary [11, 14, 9] meh 3 11
Allen [25, NA, 9] underachiever 2 35
Info from classification
Info from prediction
Info from
easy apply()
calls
84. THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
84
85. THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
85
86. THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
• Non-time series collections are also informative.This was just
one example of what you can do.
86