SlideShare a Scribd company logo
NO-SQL PYTHON
Aileen Nielsen
Software Engineer, One Drop, NYC
aileen@onedrop.today
1
OUTLINE
1. WHY?( O THER THAN THE TRENDY NAME)
2. HOW?
3. WHY? ( AGAIN)
2
1. WHY?
3
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
4
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy? Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
5
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
6
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
7
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
8
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
What makes this data tidy?
• Observations are in rows
• Variables are in columns
• Contained in a single data set
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
But can you tell me
anything useful
about this data set?
9
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Sure.These are easy
to see:
• Highest score
• Lowest score
• Total observations
10
LET’S START WITH STANDARD
SQL-LIKE, TIDY DATA
Name Day Score
Allen 1 25
Joe 3 17
Joe 2 14
Mary 2 14
Mary 1 11
Allen 3 9
Mary 3 9
Joe 1 1
Not-so-easy
• How many people?
• Who’s doing the best?
• Who’s doing the worst?
• How are individuals doing?
11
HOW ABOUT NOW?
What Changed?
• The data’s still tidy,but we’ve changed
the organizing principle
Name Score Day
Allen 25 1
Mary 11 1
Joe 1 1
Mary 14 2
Joe 14 2
Joe 17 3
Allen 9 3
Mary 9 3
12
OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
13
OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
This data’s NOTTIDY but...
14
OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
This data’s NOTTIDY but...
I can eyeball it easily
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
15
OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
This data’s NOTTIDY but...
I can eyeball it easily
And new questions become
interesting and easier to
answer
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
16
OK HOW ABOUT NOW?
(LAST TIME I PROMISE)
• How many students are there?
• Who improved?
• Who missed a test?
• Who was kind of meh?
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
17
DON’T GET MAD
I’m not saying to kill tidy Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
18
DON’T GET MAD
I’m not saying to kill tidy
But I worry we don’t use certain methods
more often because it’s not as easy as it
could be.
Name Ordered Scores
Joe [1, 14, 17]
Mary [11, 14, 9]
Allen [25, NA, 9]
19
BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE…
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
20
BEFORE I GOT INTO THE ’NOSQL’ MINDSET I
SIGHED WHEN ASKED QUESTIONS LIKE
THESE
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
21
BEFORE I GOT INTO THE ’NOSQL’
MINDSET I SIGHED THINKING ABOUT…
• App analytics What usage patterns do we see in our
long-term app users? How do those patterns evolve over
time at the individual level?
• Health researchCan we predict early on in an
experiment what’s likely to happen? Do our experiments
need to be as long as they are?
• Consumer researchDo people like things because they
like them or because of the ordering they saw them in?
22
I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
23
I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
24
I ALSO TENDED NOT TO ASK NO-SQL
QUESTIONS TOO OFTEN
• Status quo bias:humans tend to take whatever default is
presented.That happens in data analysis too.
• Endowment effect: humans tend to want what they already
have and think it’s more valuable than what’s offered for a trade.
• Especially deep finding: humans are lazy
25
Option 1:
>>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score']))
>>> no_sql_df
Name
Allen [25, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
26
IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 2:
>>> new_list = []
>>> for tuple in df.groupby(['Name']):
... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])})
...
>>> new_list
[{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1,
11), (3, 9)]}]
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
27
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
IT’S TRUE.YOU CAN ANSWER THESE
QUESTIONS WITH THE TIDY DATA FRAMES I
JUST SHOWED YOU.
Option 3:
>>> def process(new_df):
... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day'])
else None for i in range(1,4)]
...
>>> df.groupby(['Name']).apply(process)
Name
Allen [25, None, 9]
Joe [1, 14, 17]
Mary [11, 14, 9]
You can always ’reconstruct’ these trajectories of
what happened by making a data frame per user
28
>>> df
Name Day Score
0 Allen 1 25
1 Joe 3 17
2 Joe 2 14
3 Mary 2 14
4 Mary 1 11
5 Allen 3 9
6 Mary 3 9
7 Joe 1 1
LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
29
LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
30
LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
• The unit of an observationshould be the actor not the particular
action observedat a particular time.
31
LET’S BE HONEST…NO ONE WANTS THAT
TO BE A FIRST STEP TO EVERY QUERY...AND
INCREASINGLY WE’LL BE REQUIRED TO
MAKE THESE SORTS OF QUERIES
• Google ads (well…maybeless so in Europe)
• Wearable sensors
• The unit of an observationshould be the actor not the particular
action observedat a particular time.
• Maybe we should rethink what we mean by‘observations’
32
• High scalability
• Distributed computing
• Schema flexibility
• Semi-structured data
• No complex relationships
• Schema change all the time
• Patterns change all the time
• Same units of interest
repeating new things
33
We don’t look for No-SQL because we have No-SQL databases...We
have No-SQL Databases because we have no-SQL data.
WHAT IS NO-SQL PYTHON?
Data that doesn’t seem like it fits in a data frame
• Arbitrarily nested data
• Ragged data
• Comparativetime series
34
WHERE DO WE FIND NO-SQL DATA?
Here’s where I’ve found it…
• Physics lab
• Running data
• Health data
• Reddit
35
2. HOW?
36
GETTING THE DATA INTO PYTHON
WHEN IT’S STRAIGHTFORWARD
• Scenario:you’re grabbing bunch of NoSQL data from an
API or from a NoSQL db.
• We’ll stick with JSON since it’s a common format.
• Best case scenario.You’ll take everythinghowever you can
get it. In this case stick with pandas.The normalize_json
works great.
37
NORMALIZE_JSON WORKS PRETTY
WELL
38
{"samples": [
{ "name":"Jane Doe",
"age" : 42,
"profession": "architect",
"series": [
{
"day":0,
"measurement_value": 0.97
},
{
"day":1,
"measurement_value": 1.55
},
{
"day":2,
"measurement_value": 0.67
}
]},
{ name":"Bob Smith",
"hobbies":["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY
WELL
39
{"samples": [
{ "name":"Jane Doe",
"age" : 42,
"profession": "architect",
"series": [
{
"day":0,
"measurement_value": 0.97
},
{
"day":1,
"measurement_value": 1.55
},
{
"day":2,
"measurement_value": 0.67
}
]},
{ name":"Bob Smith",
"hobbies":["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY
WELL
40
{"samples": [
{ "name":"Jane Doe",
"age" : 42,
"profession": "architect",
"series": [
{
"day":0,
"measurement_value": 0.97
},
{
"day":1,
"measurement_value": 1.55
},
{
"day":2,
"measurement_value": 0.67
}
]},
{ name":"Bob Smith",
"hobbies":["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
NORMALIZE_JSON WORKS PRETTY
WELL
41
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
"measurement_value": 0.97
},
{
"day": 1,
"measurement_value": 1.55
},
{
"day": 2,
"measurement_value": 0.67
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process
NORMALIZE_JSON WORKS PRETTY
WELL
42
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
"measurement_value": 0.97
},
{
"day": 1,
"measurement_value": 1.55
},
{
"day": 2,
"measurement_value": 0.67
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works
NORMALIZE_JSON WORKS PRETTY
WELL
43
{"samples": [
{ "name": "Jane Doe",
"age" : 42,
"profession":"architect",
"series": [
{
"day": 0,
"measurement_value": 0.97
},
{
"day": 1,
"measurement_value": 1.55
},
{
"day": 2,
"measurement_value": 0.67
}
]},
{ name": "Bob Smith",
"hobbies": ["tennis", "cooking"],
"age": 37,
"series":
{
"day": 0,
"measurement_value": 1.25
} }]}
>>with open(json_file) as data_file:
>> data = json.load(data_file)
>> normalized_data =
json_normalize(data['samples'])
Easy to process
Easy to add columns
>> normalized_data['length'] =
normalized_data['series'].apply(len)
>> print(normalized_data['series'][0])[1]
>> {u'measurement_value': 1.55, u'day': 1}
Basically,it just works
USING SOME PROGRAMMER STUFF ALSO HELPS
44
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
USING SOME PROGRAMMER STUFF ALSO HELPS
45
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
USING SOME PROGRAMMER STUFF ALSO HELPS
46
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
• You should probably subclass __init__
as well for the case of inconsistent
format
USING SOME PROGRAMMER STUFF ALSO HELPS
47
class dfList(list):
def __init__(self, originalValue):
if originalValue.__class__ is list().__class__:
self = originalValue
else:
self = list(originalValue)
def __getitem__(self, item):
result = list.__getitem__(self, item)
try:
return result[ITEM_TO_GET]
except:
return result
def __iter__(self):
for i in range(len(self)):
yield self.__getitem__(i)
def __call__(self):
return sum(self)/list.__len__(self)
• Subclass an iterable to shorten your
apply() calls
• In particular,you need to subclass at
least __getitem__ and __iter__
• You should probably subclass __init__
as well for the case of inconsistent
format
• Then __call__ can be a catch-all
adjustable function...best to loadit up
with a call to a class function, which
you can adjust at-will anytime.
CUSTOM CLASSES PAIR NICELY WITH
CLASS METHODS
48
class Test:
def __init__(self, name)
self.name1 = name
def print_class_instance(instance):
print(instance.name1)
def print_self(self):
self.__class__.print_class_instance(self)
>>> test1 = Test('test1')
>>> test1.print_self()
test1
>>> def new_printing(instance):
... print("Now I'm printing a constant string")
...
>>> test1.print_self()
test1
>>> Test.print_class_instance = new_printing
>>> test1.print_self()
Now I'm printing a constant string
• Design flexible classes that often
reference class methods rather
than instance methods
• Then as you are processing data,
you can quickly swap out methods
to call different field names in the
event of highly nested JSON
• Data processing is faster and no
mental gymnastics or annoying
parse efforts required
GETTING NOSQL DATA: COMMONLY-
ENCOUNTERED PROBLEMS
• CSVs with arrays
• Highly-nested JSON
• Unknown or Unreliably formattedAPI results
49
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
50
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
51
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python
string operations to clean up the data
• Apply() is your best friend
52
SOMETIMES YOU GET WEIRD CSV FILES…
• Sometimes your problem is as simple as getting a csv file with nested data
• This is pretty straightforward to deal with…use regex and common Python string
operations to clean up the data
• Apply() is your best friend
• Common problems:spaces between “,” and column name or column value (df =
pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this
problem
53
SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
SOMETIMES YOU GET WEIRD CSV FILES…
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
SOMETIMES YOU GET WEIRD CSV FILES…
56
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
Hmmm….
>> print(df['favorites'][0][1])
>> m
SOMETIMES YOU GET WEIRD CSV FILES…
57
name,favorites,age
joe,"[madonna,elvis,u2]",28
mary,"[lady gaga, adele]",36
allen,"[beatles, u2, adele, rolling stones]"
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward
Hmmm….
>> print(df['favorites'][0][1])
>> m
Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing
>> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split())
>> print(df['favorites'][0][1])
>> adele
WHAT ABOUT THIS ONE?
58
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
This isn’t even that weird
>> df = pd.read_csv(file_name, sep =",")
Downright straightforward?
Actually this fails miserably
>> print(df['favorites'])
>> joe [madonna elvis u2]
mary [lady gaga adele] 36
Name: name, dtype: object
We need more regex…this time before applying read_csv()....
59
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
Basically,put in a the quotation marks to help out read_csv()
60
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically,put in a the quotation marks to help out read_csv()
61
name,favorites,age
joe,[madonna,elvis,u2],28
mary,[lady gaga, adele],36
allen,[beatles, u2, adele, rolling stones]
Missing quotes arouns arrays:
pattern = "([.*])"
with open(file_name) as f:
for line in f:
new_line = line
match = re.finditer(pattern, line)
try:
m = match.next()
while m:
replacement = '"'+m.group(1)+'"'
new_line = new_line.replace(m.group(1), replacement)
m = match.next()
except:
pass
with open(write_file, 'a') as write_f:
write_f.write(new_line)
new_df = pd.read_csv(write_file)
Basically,put in a the quotation marks to help out read_csv()
With multiple arrays per row,you’re gonna need to accommodate the greedy nature of regex
pattern = "([.*?])"
62
THAT WAS A LOT OF TEXT…
ALMOST DONE
SOMETIMES YOU GET JSON AND YOU KNOW
THE STRUCTURE,YOU JUST DON’T LIKE IT
• Use json_normalize()and then shed columns you don’t want.You’ve seen that
today already (slides 32-38).
• Use some magic: sh with jq module to simplify your life…you can pick out the
fields you want with jq either on the command line or with sh
• jq has a straightforward,easy to learn syntax:. = value,[] = array operation,etc… 63
cat = sh.cat
jq = sh.jq
rule = """[{name: .samples[].name, days: .samples[].series[].day}]""”
out = jq(rule, cat(_in=json_data)).stdout
json.loads(uni_out)
AND SOMETIMES YOU HAVE NO IDEA
WHAT’S IN AN ENORMOUS JSON FILE
• Inconsistent or undocumentedAPI
• Legacy Mongo database
• Someone handed you some gnarly JSON because they
couldn’t parse it
64
YOU’RE A PROGRAMMER…USE ITERATORS
• The ijson module is an iterator JSON parser…you can
deal with structure one bit at a time
• This also gives you a great opportunity to make data
parsing decisions as you go
• This isn’t fast,but it’s also not fast to shoot from the hip
when you’re talking about gnarly JSON
65
YOU’RE A PROGRAMMER…USE ITERATORS
66
with open(file_name, 'rb') as file:
results = ijson.items(file, "samples.item")
for newRecord in results:
record = newRecord
for k in record.keys():
if isinstance(record[k], dict().__class__):
recursive_check(record[k])
if isinstance(record[k], list().__class__):
recursive_check(record[k])
process(record)
YOU’RE A PROGRAMMER…USE ITERATORS
67
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a
space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically,you can build custom classes
or generate appropriate named tuples
as you go.
• This lets you know what you have and
lets you build data structures to
accommodate what you have.
• Storing these objects in a class rather
than simple dictionary again gives you
the option to customize .__call__()
to your needs
YOU’RE A PROGRAMMER…USE ITERATORS
68
total_dict = defaultdict(lambda: False)
def recursive_check(d):
if isinstance(d, dict().__class__):
if not total_dict[tuple(sorted(d.keys()))]:
class_name = raw_input("Input the new classname with a space and then the file name defining the class ")
mod = import_module(class_name)
cls = getattr(mod, class_name)
total_dict[tuple(sorted(d.keys()))] = cls
for k in d.keys():
new_class = recursive_check(k)
if new_class:
d[k] = new_class(**d[k])
return total_dict[tuple(sorted(d.keys()))]
elif isinstance(d, list().__class__):
for i in range(len(d)):
new_class = recursive_check(d[i])
if new_class:
d[i] = new_class(**d[i])
else:
return False
• Basically, you can build custom classes or
generate appropriate Named tuples as you go.
This lets you know what you have and lets you
build data structures to accommodate what you
have.
• Again remember that class methods can easily
be adjusted dynamically,so it’s good to code
classes with instances that reference class
methods.
3. WHY?
(AGAIN)
69
CLUSTERING TIME SERIES
• Reports of clustering and classifying time
series are surprisingly rare
• Methods are computationallydemanding
O(N2)… but we’re getting there
• Relatedly‘classification’ can also be used for
series-related predictions
• Can use many commonly applied clustering
algorithms once you get the distance metric
70
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
WHEN DO PEOPLE GO RUNNING?
71
WHEN DO PEOPLE GO RUNNING?
72
Actually,I made these plots with R…
NANO-SCALE PHYSICS
73
Meisner et al, J. Am. Chem. Soc. 2012, 134, 20440−20445
• You can build an electrical circuit which has
a single molecule as its narrowest part
• It turns out it’s quite easy to distinguish
different molecules depending on their
trajectory as you pull on them
• Particularly their summed behavior looks
quite different
• Suggests that we could cluster and identify
individual measurements with reasonable
certainty
74
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
REDDIT
75
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
REDDIT
76
• Several months of pulling the top 25 threads off Reddit’s
front page shows significantly different trends for
different subreddits.
• Some kinds of posts don’t last long (r/TwoX and r/videos)
• r/personalfinance shows a remarkable ability to have a
second peak/second life on the front page
• r/videos do great but burn out quickly
REDDIT
QUICK: HOW IT WORKS I.
77
QUICK: HOW IT WORKS II.
78
• O(N2) in theory
• Various lower bounding techniques
significantly reduce processing time
• Dynamic programmingproblem
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
WHY THE FANCY METHOD?
79
Euclidean distance matches ts3
to ts1,despite our intuition that
ts1 and ts2 are more alike.
http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
http://nbviewer.jupyter.org/github/alexmi
nnaar/time-series-classification-and
clustering/blob/master/Time%20Series%
20Classification%20and%20Clustering.ipy
nb
BIKE-SHARING STANDS
80http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1
http://ofdataandscience.blogspot.nl/2013/03/capital-bikeshare-time-series-clustering.html?m=1
FUTURE RESEARCH POSSIBILITIES
81
http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
WHY?
Time series classification and related metrics can be one more thing to
know…or even several more things to know
82
Name Ordered
Scores
Score
Trajectory Type
Number of
Tests
Predicted
Score For
Next Test
Joe [1, 14, 17] good 3 19
Mary [11, 14, 9] meh 3 11
Allen [25, NA, 9] underachiever 2 35
Info from classification
Info from prediction
Info from
easy apply()
calls
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
83
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
84
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
85
THE SHORT VERSION
• Pandas is already well-adapted to the No-SQL world
• Make your data format work for you
• Comparative time series go hand-in-hand with the increasing
availability of No-SQL data. Everything is a time series if you
look hard enough.
• Non-time series collections are also informative.This was just
one example of what you can do.
86
THANK YOU
87

More Related Content

Similar to Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

Executive Function: Effective Strategies and Interventions
Executive Function:  Effective Strategies and InterventionsExecutive Function:  Effective Strategies and Interventions
Executive Function: Effective Strategies and Interventions
David Nowell
 
Surveys that work: training course for Rosenfeld Media, day 3
Surveys that work: training course for Rosenfeld Media, day 3 Surveys that work: training course for Rosenfeld Media, day 3
Surveys that work: training course for Rosenfeld Media, day 3
Caroline Jarrett
 
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
dradilkhan87
 
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
Brandon Fix
 
Advanced Analytics for Salesforce
Advanced Analytics for SalesforceAdvanced Analytics for Salesforce
Advanced Analytics for Salesforce
Looker
 
Data Visualization Workflow
Data Visualization WorkflowData Visualization Workflow
Data Visualization Workflow
jeremycadams
 
Data Visualization Workflow
Data Visualization WorkflowData Visualization Workflow
Data Visualization Workflow
jeremycadams
 
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent ScoreQuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
James Wirth
 
Unit1 ed572seminar
Unit1 ed572seminarUnit1 ed572seminar
Unit1 ed572seminar
drbrizuelakaplan
 
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
Insight Technology, Inc.
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
Sara Hooker
 
Surveys that work:training course for Rosenfeld Media, day 1
Surveys that work:training course for Rosenfeld Media, day 1Surveys that work:training course for Rosenfeld Media, day 1
Surveys that work:training course for Rosenfeld Media, day 1
Caroline Jarrett
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data Mining
Nandakumar P
 
computer educatio_database theory.pptx
computer educatio_database theory.pptxcomputer educatio_database theory.pptx
computer educatio_database theory.pptx
CecillePicasoMore
 
Effective Business Communication with Precision Questioning and Answering
Effective Business Communication with Precision Questioning and AnsweringEffective Business Communication with Precision Questioning and Answering
Effective Business Communication with Precision Questioning and Answering
Society of Women Engineers
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
Tobias Lindaaker
 
Sorting algos
Sorting algosSorting algos
Sorting algos
Omair Imtiaz Ansari
 
Normalisation
NormalisationNormalisation
Normalisation
Wira Galacticos
 
introDM.ppt
introDM.pptintroDM.ppt
introDM.ppt
Arumugam Prakash
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
DEEPAK948083
 

Similar to Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world (20)

Executive Function: Effective Strategies and Interventions
Executive Function:  Effective Strategies and InterventionsExecutive Function:  Effective Strategies and Interventions
Executive Function: Effective Strategies and Interventions
 
Surveys that work: training course for Rosenfeld Media, day 3
Surveys that work: training course for Rosenfeld Media, day 3 Surveys that work: training course for Rosenfeld Media, day 3
Surveys that work: training course for Rosenfeld Media, day 3
 
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
Introduction to Data Mining (Why Mine Data? Commercial Viewpoint)
 
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
10 tough decisions donor data migration decisions (Webinar hosted by Bloomera...
 
Advanced Analytics for Salesforce
Advanced Analytics for SalesforceAdvanced Analytics for Salesforce
Advanced Analytics for Salesforce
 
Data Visualization Workflow
Data Visualization WorkflowData Visualization Workflow
Data Visualization Workflow
 
Data Visualization Workflow
Data Visualization WorkflowData Visualization Workflow
Data Visualization Workflow
 
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent ScoreQuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
QuestionPro Integrates with TryMyUI to Launch the Survey Respondent Score
 
Unit1 ed572seminar
Unit1 ed572seminarUnit1 ed572seminar
Unit1 ed572seminar
 
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
[INSIGHT OUT 2011] A21 why why is probably the right answer(tom kyte)
 
Module 1.3 data exploratory
Module 1.3  data exploratoryModule 1.3  data exploratory
Module 1.3 data exploratory
 
Surveys that work:training course for Rosenfeld Media, day 1
Surveys that work:training course for Rosenfeld Media, day 1Surveys that work:training course for Rosenfeld Media, day 1
Surveys that work:training course for Rosenfeld Media, day 1
 
UNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data MiningUNIT 2: Part 1: Data Warehousing and Data Mining
UNIT 2: Part 1: Data Warehousing and Data Mining
 
computer educatio_database theory.pptx
computer educatio_database theory.pptxcomputer educatio_database theory.pptx
computer educatio_database theory.pptx
 
Effective Business Communication with Precision Questioning and Answering
Effective Business Communication with Precision Questioning and AnsweringEffective Business Communication with Precision Questioning and Answering
Effective Business Communication with Precision Questioning and Answering
 
Building Applications with a Graph Database
Building Applications with a Graph DatabaseBuilding Applications with a Graph Database
Building Applications with a Graph Database
 
Sorting algos
Sorting algosSorting algos
Sorting algos
 
Normalisation
NormalisationNormalisation
Normalisation
 
introDM.ppt
introDM.pptintroDM.ppt
introDM.ppt
 
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.pptintroDMintroDMintroDMintroDMintroDMintroDM.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
PyData
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
PyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
PyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
PyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
PyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
PyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
PyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
PyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
PyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
PyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
PyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
PyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 

Recently uploaded

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
y3i0qsdzb
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
AndrzejJarynowski
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
a9qfiubqu
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
Bill641377
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
Timothy Spann
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Aggregage
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
exukyp
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
bmucuha
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
ihavuls
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
facilitymanager11
 

Recently uploaded (20)

Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
一比一原版巴斯大学毕业证(Bath毕业证书)学历如何办理
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
Intelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicineIntelligence supported media monitoring in veterinary medicine
Intelligence supported media monitoring in veterinary medicine
 
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
原版一比一弗林德斯大学毕业证(Flinders毕业证书)如何办理
 
Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...Population Growth in Bataan: The effects of population growth around rural pl...
Population Growth in Bataan: The effects of population growth around rural pl...
 
The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
06-12-2024-BudapestDataForum-BuildingReal-timePipelineswithFLaNK AIM
 
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
UofT毕业证如何办理
UofT毕业证如何办理UofT毕业证如何办理
UofT毕业证如何办理
 
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
一比一原版(CU毕业证)卡尔顿大学毕业证如何办理
 
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
原版制作(unimelb毕业证书)墨尔本大学毕业证Offer一模一样
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024Monthly Management report for the Month of May 2024
Monthly Management report for the Month of May 2024
 

Aileen Nielsen - NoSQL Python: making data frames work for you in a non-rectangular world

  • 1. NO-SQL PYTHON Aileen Nielsen Software Engineer, One Drop, NYC aileen@onedrop.today 1
  • 2. OUTLINE 1. WHY?( O THER THAN THE TRENDY NAME) 2. HOW? 3. WHY? ( AGAIN) 2
  • 4. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 4
  • 5. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA What makes this data tidy? Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 5
  • 6. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA What makes this data tidy? • Observations are in rows Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 6
  • 7. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA What makes this data tidy? • Observations are in rows • Variables are in columns Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 7
  • 8. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA What makes this data tidy? • Observations are in rows • Variables are in columns • Contained in a single data set Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 8
  • 9. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA What makes this data tidy? • Observations are in rows • Variables are in columns • Contained in a single data set Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 But can you tell me anything useful about this data set? 9
  • 10. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 Sure.These are easy to see: • Highest score • Lowest score • Total observations 10
  • 11. LET’S START WITH STANDARD SQL-LIKE, TIDY DATA Name Day Score Allen 1 25 Joe 3 17 Joe 2 14 Mary 2 14 Mary 1 11 Allen 3 9 Mary 3 9 Joe 1 1 Not-so-easy • How many people? • Who’s doing the best? • Who’s doing the worst? • How are individuals doing? 11
  • 12. HOW ABOUT NOW? What Changed? • The data’s still tidy,but we’ve changed the organizing principle Name Score Day Allen 25 1 Mary 11 1 Joe 1 1 Mary 14 2 Joe 14 2 Joe 17 3 Allen 9 3 Mary 9 3 12
  • 13. OK HOW ABOUT NOW? (LAST TIME I PROMISE) Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 13
  • 14. OK HOW ABOUT NOW? (LAST TIME I PROMISE) Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] This data’s NOTTIDY but... 14
  • 15. OK HOW ABOUT NOW? (LAST TIME I PROMISE) This data’s NOTTIDY but... I can eyeball it easily Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 15
  • 16. OK HOW ABOUT NOW? (LAST TIME I PROMISE) This data’s NOTTIDY but... I can eyeball it easily And new questions become interesting and easier to answer Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 16
  • 17. OK HOW ABOUT NOW? (LAST TIME I PROMISE) • How many students are there? • Who improved? • Who missed a test? • Who was kind of meh? Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 17
  • 18. DON’T GET MAD I’m not saying to kill tidy Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 18
  • 19. DON’T GET MAD I’m not saying to kill tidy But I worry we don’t use certain methods more often because it’s not as easy as it could be. Name Ordered Scores Joe [1, 14, 17] Mary [11, 14, 9] Allen [25, NA, 9] 19
  • 20. BEFORE I GOT INTO THE ’NOSQL’ MINDSET I SIGHED WHEN ASKED QUESTIONS LIKE… • App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level? 20
  • 21. BEFORE I GOT INTO THE ’NOSQL’ MINDSET I SIGHED WHEN ASKED QUESTIONS LIKE THESE • App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level? • Health researchCan we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are? 21
  • 22. BEFORE I GOT INTO THE ’NOSQL’ MINDSET I SIGHED THINKING ABOUT… • App analytics What usage patterns do we see in our long-term app users? How do those patterns evolve over time at the individual level? • Health researchCan we predict early on in an experiment what’s likely to happen? Do our experiments need to be as long as they are? • Consumer researchDo people like things because they like them or because of the ordering they saw them in? 22
  • 23. I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN • Status quo bias:humans tend to take whatever default is presented.That happens in data analysis too. 23
  • 24. I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN • Status quo bias:humans tend to take whatever default is presented.That happens in data analysis too. • Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade. 24
  • 25. I ALSO TENDED NOT TO ASK NO-SQL QUESTIONS TOO OFTEN • Status quo bias:humans tend to take whatever default is presented.That happens in data analysis too. • Endowment effect: humans tend to want what they already have and think it’s more valuable than what’s offered for a trade. • Especially deep finding: humans are lazy 25
  • 26. Option 1: >>> no_sql_df = df.groupby('Name').apply(lambda df: list(df.sort_values(by='Day')['Score'])) >>> no_sql_df Name Allen [25, 9] Joe [1, 14, 17] Mary [11, 14, 9] 26 IT’S TRUE.YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I JUST SHOWED YOU. You can always ’reconstruct’ these trajectories of what happened by making a data frame per user >>> df Name Day Score 0 Allen 1 25 1 Joe 3 17 2 Joe 2 14 3 Mary 2 14 4 Mary 1 11 5 Allen 3 9 6 Mary 3 9 7 Joe 1 1
  • 27. IT’S TRUE.YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I JUST SHOWED YOU. Option 2: >>> new_list = [] >>> for tuple in df.groupby(['Name']): ... new_list.append({tuple[0]: zip(tuple[1]['Day'], tuple[1]['Score'])}) ... >>> new_list [{'Allen': [(1, 25), (3, 9)]}, {'Joe': [(3, 17), (2, 14), (1, 1)]}, {'Mary': [(2, 14), (1, 11), (3, 9)]}] You can always ’reconstruct’ these trajectories of what happened by making a data frame per user 27 >>> df Name Day Score 0 Allen 1 25 1 Joe 3 17 2 Joe 2 14 3 Mary 2 14 4 Mary 1 11 5 Allen 3 9 6 Mary 3 9 7 Joe 1 1
  • 28. IT’S TRUE.YOU CAN ANSWER THESE QUESTIONS WITH THE TIDY DATA FRAMES I JUST SHOWED YOU. Option 3: >>> def process(new_df): ... return [new_df[new_df['Day']==i]['Score'].values[0] if i in list(new_df['Day']) else None for i in range(1,4)] ... >>> df.groupby(['Name']).apply(process) Name Allen [25, None, 9] Joe [1, 14, 17] Mary [11, 14, 9] You can always ’reconstruct’ these trajectories of what happened by making a data frame per user 28 >>> df Name Day Score 0 Allen 1 25 1 Joe 3 17 2 Joe 2 14 3 Mary 2 14 4 Mary 1 11 5 Allen 3 9 6 Mary 3 9 7 Joe 1 1
  • 29. LET’S BE HONEST…NO ONE WANTS THAT TO BE A FIRST STEP TO EVERY QUERY...AND INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES • Google ads (well…maybeless so in Europe) 29
  • 30. LET’S BE HONEST…NO ONE WANTS THAT TO BE A FIRST STEP TO EVERY QUERY...AND INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES • Google ads (well…maybeless so in Europe) • Wearable sensors 30
  • 31. LET’S BE HONEST…NO ONE WANTS THAT TO BE A FIRST STEP TO EVERY QUERY...AND INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES • Google ads (well…maybeless so in Europe) • Wearable sensors • The unit of an observationshould be the actor not the particular action observedat a particular time. 31
  • 32. LET’S BE HONEST…NO ONE WANTS THAT TO BE A FIRST STEP TO EVERY QUERY...AND INCREASINGLY WE’LL BE REQUIRED TO MAKE THESE SORTS OF QUERIES • Google ads (well…maybeless so in Europe) • Wearable sensors • The unit of an observationshould be the actor not the particular action observedat a particular time. • Maybe we should rethink what we mean by‘observations’ 32
  • 33. • High scalability • Distributed computing • Schema flexibility • Semi-structured data • No complex relationships • Schema change all the time • Patterns change all the time • Same units of interest repeating new things 33 We don’t look for No-SQL because we have No-SQL databases...We have No-SQL Databases because we have no-SQL data.
  • 34. WHAT IS NO-SQL PYTHON? Data that doesn’t seem like it fits in a data frame • Arbitrarily nested data • Ragged data • Comparativetime series 34
  • 35. WHERE DO WE FIND NO-SQL DATA? Here’s where I’ve found it… • Physics lab • Running data • Health data • Reddit 35
  • 37. GETTING THE DATA INTO PYTHON WHEN IT’S STRAIGHTFORWARD • Scenario:you’re grabbing bunch of NoSQL data from an API or from a NoSQL db. • We’ll stick with JSON since it’s a common format. • Best case scenario.You’ll take everythinghowever you can get it. In this case stick with pandas.The normalize_json works great. 37
  • 38. NORMALIZE_JSON WORKS PRETTY WELL 38 {"samples": [ { "name":"Jane Doe", "age" : 42, "profession": "architect", "series": [ { "day":0, "measurement_value": 0.97 }, { "day":1, "measurement_value": 1.55 }, { "day":2, "measurement_value": 0.67 } ]}, { name":"Bob Smith", "hobbies":["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]}
  • 39. NORMALIZE_JSON WORKS PRETTY WELL 39 {"samples": [ { "name":"Jane Doe", "age" : 42, "profession": "architect", "series": [ { "day":0, "measurement_value": 0.97 }, { "day":1, "measurement_value": 1.55 }, { "day":2, "measurement_value": 0.67 } ]}, { name":"Bob Smith", "hobbies":["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]}
  • 40. NORMALIZE_JSON WORKS PRETTY WELL 40 {"samples": [ { "name":"Jane Doe", "age" : 42, "profession": "architect", "series": [ { "day":0, "measurement_value": 0.97 }, { "day":1, "measurement_value": 1.55 }, { "day":2, "measurement_value": 0.67 } ]}, { name":"Bob Smith", "hobbies":["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]}
  • 41. NORMALIZE_JSON WORKS PRETTY WELL 41 {"samples": [ { "name": "Jane Doe", "age" : 42, "profession":"architect", "series": [ { "day": 0, "measurement_value": 0.97 }, { "day": 1, "measurement_value": 1.55 }, { "day": 2, "measurement_value": 0.67 } ]}, { name": "Bob Smith", "hobbies": ["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]} >>with open(json_file) as data_file: >> data = json.load(data_file) >> normalized_data = json_normalize(data['samples']) Easy to process
  • 42. NORMALIZE_JSON WORKS PRETTY WELL 42 {"samples": [ { "name": "Jane Doe", "age" : 42, "profession":"architect", "series": [ { "day": 0, "measurement_value": 0.97 }, { "day": 1, "measurement_value": 1.55 }, { "day": 2, "measurement_value": 0.67 } ]}, { name": "Bob Smith", "hobbies": ["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]} >>with open(json_file) as data_file: >> data = json.load(data_file) >> normalized_data = json_normalize(data['samples']) Easy to process >> print(normalized_data['series'][0])[1] >> {u'measurement_value': 1.55, u'day': 1} Basically,it just works
  • 43. NORMALIZE_JSON WORKS PRETTY WELL 43 {"samples": [ { "name": "Jane Doe", "age" : 42, "profession":"architect", "series": [ { "day": 0, "measurement_value": 0.97 }, { "day": 1, "measurement_value": 1.55 }, { "day": 2, "measurement_value": 0.67 } ]}, { name": "Bob Smith", "hobbies": ["tennis", "cooking"], "age": 37, "series": { "day": 0, "measurement_value": 1.25 } }]} >>with open(json_file) as data_file: >> data = json.load(data_file) >> normalized_data = json_normalize(data['samples']) Easy to process Easy to add columns >> normalized_data['length'] = normalized_data['series'].apply(len) >> print(normalized_data['series'][0])[1] >> {u'measurement_value': 1.55, u'day': 1} Basically,it just works
  • 44. USING SOME PROGRAMMER STUFF ALSO HELPS 44 class dfList(list): def __init__(self, originalValue): if originalValue.__class__ is list().__class__: self = originalValue else: self = list(originalValue) def __getitem__(self, item): result = list.__getitem__(self, item) try: return result[ITEM_TO_GET] except: return result def __iter__(self): for i in range(len(self)): yield self.__getitem__(i) def __call__(self): return sum(self)/list.__len__(self) • Subclass an iterable to shorten your apply() calls
  • 45. USING SOME PROGRAMMER STUFF ALSO HELPS 45 class dfList(list): def __init__(self, originalValue): if originalValue.__class__ is list().__class__: self = originalValue else: self = list(originalValue) def __getitem__(self, item): result = list.__getitem__(self, item) try: return result[ITEM_TO_GET] except: return result def __iter__(self): for i in range(len(self)): yield self.__getitem__(i) def __call__(self): return sum(self)/list.__len__(self) • Subclass an iterable to shorten your apply() calls • In particular,you need to subclass at least __getitem__ and __iter__
  • 46. USING SOME PROGRAMMER STUFF ALSO HELPS 46 class dfList(list): def __init__(self, originalValue): if originalValue.__class__ is list().__class__: self = originalValue else: self = list(originalValue) def __getitem__(self, item): result = list.__getitem__(self, item) try: return result[ITEM_TO_GET] except: return result def __iter__(self): for i in range(len(self)): yield self.__getitem__(i) def __call__(self): return sum(self)/list.__len__(self) • Subclass an iterable to shorten your apply() calls • In particular,you need to subclass at least __getitem__ and __iter__ • You should probably subclass __init__ as well for the case of inconsistent format
  • 47. USING SOME PROGRAMMER STUFF ALSO HELPS 47 class dfList(list): def __init__(self, originalValue): if originalValue.__class__ is list().__class__: self = originalValue else: self = list(originalValue) def __getitem__(self, item): result = list.__getitem__(self, item) try: return result[ITEM_TO_GET] except: return result def __iter__(self): for i in range(len(self)): yield self.__getitem__(i) def __call__(self): return sum(self)/list.__len__(self) • Subclass an iterable to shorten your apply() calls • In particular,you need to subclass at least __getitem__ and __iter__ • You should probably subclass __init__ as well for the case of inconsistent format • Then __call__ can be a catch-all adjustable function...best to loadit up with a call to a class function, which you can adjust at-will anytime.
  • 48. CUSTOM CLASSES PAIR NICELY WITH CLASS METHODS 48 class Test: def __init__(self, name) self.name1 = name def print_class_instance(instance): print(instance.name1) def print_self(self): self.__class__.print_class_instance(self) >>> test1 = Test('test1') >>> test1.print_self() test1 >>> def new_printing(instance): ... print("Now I'm printing a constant string") ... >>> test1.print_self() test1 >>> Test.print_class_instance = new_printing >>> test1.print_self() Now I'm printing a constant string • Design flexible classes that often reference class methods rather than instance methods • Then as you are processing data, you can quickly swap out methods to call different field names in the event of highly nested JSON • Data processing is faster and no mental gymnastics or annoying parse efforts required
  • 49. GETTING NOSQL DATA: COMMONLY- ENCOUNTERED PROBLEMS • CSVs with arrays • Highly-nested JSON • Unknown or Unreliably formattedAPI results 49
  • 50. SOMETIMES YOU GET WEIRD CSV FILES… • Sometimes your problem is as simple as getting a csv file with nested data 50
  • 51. SOMETIMES YOU GET WEIRD CSV FILES… • Sometimes your problem is as simple as getting a csv file with nested data • This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data 51
  • 52. SOMETIMES YOU GET WEIRD CSV FILES… • Sometimes your problem is as simple as getting a csv file with nested data • This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data • Apply() is your best friend 52
  • 53. SOMETIMES YOU GET WEIRD CSV FILES… • Sometimes your problem is as simple as getting a csv file with nested data • This is pretty straightforward to deal with…use regex and common Python string operations to clean up the data • Apply() is your best friend • Common problems:spaces between “,” and column name or column value (df = pd.read_csv("in.csv",sep="," , skipinitialspace=1)) use a parameter to avoid this problem 53
  • 54. SOMETIMES YOU GET WEIRD CSV FILES… name,favorites,age joe,"[madonna,elvis,u2]",28 mary,"[lady gaga, adele]",36 allen,"[beatles, u2, adele, rolling stones]" This isn’t even that weird
  • 55. SOMETIMES YOU GET WEIRD CSV FILES… name,favorites,age joe,"[madonna,elvis,u2]",28 mary,"[lady gaga, adele]",36 allen,"[beatles, u2, adele, rolling stones]" This isn’t even that weird >> df = pd.read_csv(file_name, sep =",") Downright straightforward
  • 56. SOMETIMES YOU GET WEIRD CSV FILES… 56 name,favorites,age joe,"[madonna,elvis,u2]",28 mary,"[lady gaga, adele]",36 allen,"[beatles, u2, adele, rolling stones]" This isn’t even that weird >> df = pd.read_csv(file_name, sep =",") Downright straightforward Hmmm…. >> print(df['favorites'][0][1]) >> m
  • 57. SOMETIMES YOU GET WEIRD CSV FILES… 57 name,favorites,age joe,"[madonna,elvis,u2]",28 mary,"[lady gaga, adele]",36 allen,"[beatles, u2, adele, rolling stones]" This isn’t even that weird >> df = pd.read_csv(file_name, sep =",") Downright straightforward Hmmm…. >> print(df['favorites'][0][1]) >> m Regex to the rescue…Python’s exceptionally easy string parsing a huge asset for No-SQL parsing >> df['favorites'] = df['favorites'].apply(lambda s: s[1:-1].split()) >> print(df['favorites'][0][1]) >> adele
  • 58. WHAT ABOUT THIS ONE? 58 name,favorites,age joe,[madonna,elvis,u2],28 mary,[lady gaga, adele],36 allen,[beatles, u2, adele, rolling stones] This isn’t even that weird >> df = pd.read_csv(file_name, sep =",") Downright straightforward? Actually this fails miserably >> print(df['favorites']) >> joe [madonna elvis u2] mary [lady gaga adele] 36 Name: name, dtype: object We need more regex…this time before applying read_csv()....
  • 59. 59 name,favorites,age joe,[madonna,elvis,u2],28 mary,[lady gaga, adele],36 allen,[beatles, u2, adele, rolling stones] Missing quotes arouns arrays: Basically,put in a the quotation marks to help out read_csv()
  • 60. 60 name,favorites,age joe,[madonna,elvis,u2],28 mary,[lady gaga, adele],36 allen,[beatles, u2, adele, rolling stones] Missing quotes arouns arrays: pattern = "([.*])" with open(file_name) as f: for line in f: new_line = line match = re.finditer(pattern, line) try: m = match.next() while m: replacement = '"'+m.group(1)+'"' new_line = new_line.replace(m.group(1), replacement) m = match.next() except: pass with open(write_file, 'a') as write_f: write_f.write(new_line) new_df = pd.read_csv(write_file) Basically,put in a the quotation marks to help out read_csv()
  • 61. 61 name,favorites,age joe,[madonna,elvis,u2],28 mary,[lady gaga, adele],36 allen,[beatles, u2, adele, rolling stones] Missing quotes arouns arrays: pattern = "([.*])" with open(file_name) as f: for line in f: new_line = line match = re.finditer(pattern, line) try: m = match.next() while m: replacement = '"'+m.group(1)+'"' new_line = new_line.replace(m.group(1), replacement) m = match.next() except: pass with open(write_file, 'a') as write_f: write_f.write(new_line) new_df = pd.read_csv(write_file) Basically,put in a the quotation marks to help out read_csv() With multiple arrays per row,you’re gonna need to accommodate the greedy nature of regex pattern = "([.*?])"
  • 62. 62 THAT WAS A LOT OF TEXT… ALMOST DONE
  • 63. SOMETIMES YOU GET JSON AND YOU KNOW THE STRUCTURE,YOU JUST DON’T LIKE IT • Use json_normalize()and then shed columns you don’t want.You’ve seen that today already (slides 32-38). • Use some magic: sh with jq module to simplify your life…you can pick out the fields you want with jq either on the command line or with sh • jq has a straightforward,easy to learn syntax:. = value,[] = array operation,etc… 63 cat = sh.cat jq = sh.jq rule = """[{name: .samples[].name, days: .samples[].series[].day}]""” out = jq(rule, cat(_in=json_data)).stdout json.loads(uni_out)
  • 64. AND SOMETIMES YOU HAVE NO IDEA WHAT’S IN AN ENORMOUS JSON FILE • Inconsistent or undocumentedAPI • Legacy Mongo database • Someone handed you some gnarly JSON because they couldn’t parse it 64
  • 65. YOU’RE A PROGRAMMER…USE ITERATORS • The ijson module is an iterator JSON parser…you can deal with structure one bit at a time • This also gives you a great opportunity to make data parsing decisions as you go • This isn’t fast,but it’s also not fast to shoot from the hip when you’re talking about gnarly JSON 65
  • 66. YOU’RE A PROGRAMMER…USE ITERATORS 66 with open(file_name, 'rb') as file: results = ijson.items(file, "samples.item") for newRecord in results: record = newRecord for k in record.keys(): if isinstance(record[k], dict().__class__): recursive_check(record[k]) if isinstance(record[k], list().__class__): recursive_check(record[k]) process(record)
  • 67. YOU’RE A PROGRAMMER…USE ITERATORS 67 total_dict = defaultdict(lambda: False) def recursive_check(d): if isinstance(d, dict().__class__): if not total_dict[tuple(sorted(d.keys()))]: class_name = raw_input("Input the new classname with a space and then the file name defining the class ") mod = import_module(class_name) cls = getattr(mod, class_name) total_dict[tuple(sorted(d.keys()))] = cls for k in d.keys(): new_class = recursive_check(k) if new_class: d[k] = new_class(**d[k]) return total_dict[tuple(sorted(d.keys()))] elif isinstance(d, list().__class__): for i in range(len(d)): new_class = recursive_check(d[i]) if new_class: d[i] = new_class(**d[i]) else: return False • Basically,you can build custom classes or generate appropriate named tuples as you go. • This lets you know what you have and lets you build data structures to accommodate what you have. • Storing these objects in a class rather than simple dictionary again gives you the option to customize .__call__() to your needs
  • 68. YOU’RE A PROGRAMMER…USE ITERATORS 68 total_dict = defaultdict(lambda: False) def recursive_check(d): if isinstance(d, dict().__class__): if not total_dict[tuple(sorted(d.keys()))]: class_name = raw_input("Input the new classname with a space and then the file name defining the class ") mod = import_module(class_name) cls = getattr(mod, class_name) total_dict[tuple(sorted(d.keys()))] = cls for k in d.keys(): new_class = recursive_check(k) if new_class: d[k] = new_class(**d[k]) return total_dict[tuple(sorted(d.keys()))] elif isinstance(d, list().__class__): for i in range(len(d)): new_class = recursive_check(d[i]) if new_class: d[i] = new_class(**d[i]) else: return False • Basically, you can build custom classes or generate appropriate Named tuples as you go. This lets you know what you have and lets you build data structures to accommodate what you have. • Again remember that class methods can easily be adjusted dynamically,so it’s good to code classes with instances that reference class methods.
  • 70. CLUSTERING TIME SERIES • Reports of clustering and classifying time series are surprisingly rare • Methods are computationallydemanding O(N2)… but we’re getting there • Relatedly‘classification’ can also be used for series-related predictions • Can use many commonly applied clustering algorithms once you get the distance metric 70 http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf
  • 71. WHEN DO PEOPLE GO RUNNING? 71
  • 72. WHEN DO PEOPLE GO RUNNING? 72 Actually,I made these plots with R…
  • 73. NANO-SCALE PHYSICS 73 Meisner et al, J. Am. Chem. Soc. 2012, 134, 20440−20445 • You can build an electrical circuit which has a single molecule as its narrowest part • It turns out it’s quite easy to distinguish different molecules depending on their trajectory as you pull on them • Particularly their summed behavior looks quite different • Suggests that we could cluster and identify individual measurements with reasonable certainty
  • 74. 74 • Several months of pulling the top 25 threads off Reddit’s front page shows significantly different trends for different subreddits. REDDIT
  • 75. 75 • Several months of pulling the top 25 threads off Reddit’s front page shows significantly different trends for different subreddits. REDDIT
  • 76. 76 • Several months of pulling the top 25 threads off Reddit’s front page shows significantly different trends for different subreddits. • Some kinds of posts don’t last long (r/TwoX and r/videos) • r/personalfinance shows a remarkable ability to have a second peak/second life on the front page • r/videos do great but burn out quickly REDDIT
  • 77. QUICK: HOW IT WORKS I. 77
  • 78. QUICK: HOW IT WORKS II. 78 • O(N2) in theory • Various lower bounding techniques significantly reduce processing time • Dynamic programmingproblem http://wearables.cc.gatech.edu/paper_of_week/DTW_myths.pdf
  • 79. WHY THE FANCY METHOD? 79 Euclidean distance matches ts3 to ts1,despite our intuition that ts1 and ts2 are more alike. http://www.cse.ust.hk/vldb2002/VLDB2002-proceedings/slides/S12P01slides.pdf http://nbviewer.jupyter.org/github/alexmi nnaar/time-series-classification-and clustering/blob/master/Time%20Series% 20Classification%20and%20Clustering.ipy nb
  • 82. WHY? Time series classification and related metrics can be one more thing to know…or even several more things to know 82 Name Ordered Scores Score Trajectory Type Number of Tests Predicted Score For Next Test Joe [1, 14, 17] good 3 19 Mary [11, 14, 9] meh 3 11 Allen [25, NA, 9] underachiever 2 35 Info from classification Info from prediction Info from easy apply() calls
  • 83. THE SHORT VERSION • Pandas is already well-adapted to the No-SQL world 83
  • 84. THE SHORT VERSION • Pandas is already well-adapted to the No-SQL world • Make your data format work for you 84
  • 85. THE SHORT VERSION • Pandas is already well-adapted to the No-SQL world • Make your data format work for you • Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough. 85
  • 86. THE SHORT VERSION • Pandas is already well-adapted to the No-SQL world • Make your data format work for you • Comparative time series go hand-in-hand with the increasing availability of No-SQL data. Everything is a time series if you look hard enough. • Non-time series collections are also informative.This was just one example of what you can do. 86