1) Big data has grown exponentially in recent decades, from megabytes to petabytes, requiring new techniques for processing and analyzing large, diverse datasets.
2) Distributed computing frameworks like MapReduce allow massive datasets to be processed in parallel across many servers, similarly to how the human brain solves problems.
3) Free and open source big data tools now exist, like Hadoop and NoSQL databases, allowing individuals to leverage large public datasets and gain insights from their own data using cloud computing resources.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Big data 2013-05-23
1. Big Data
Øyvin Halfan Thuv
CTO Whitefox AS
e: oyvin@whitefox.no
t: @oyvinht
torsdag 23. mai 13
2. Abstract
• Short wrap up of Big Data history
• But, what’s new? Why are we here?
• What can we do now (from our couch) ?
torsdag 23. mai 13
3. Who am I... to talk about this?
• Ardent interest, B.Sc. in IT
Maths (I particularly recommend discrete maths for Big Data!)
Computational Linguistics
AI stuff
Thesis on data mining Unix system logs for surveillance
• M.Sc. degree in Artificial Intelligence (AI)
Thesis on artificial life:
«Incrementally Evolving a Dynamic Neural Network for Tactile-olfactory Insect Navigation»
Nature is packed with Big Data
• Intern at CERN
Developing the search engine
Indexing (and making sense) of > 6 million documents
torsdag 23. mai 13
4. Mini-history
• Before ~2000
Save just the stuff that could prove useful. Query/filter/select data to present it.
• After ~2000
Just store everything - it’s cheap and we can look into it later.
OLAP automates «looking».
• Gartner 2012:
«Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to
enable enhanced decision making, insight discovery and process optimization.»
(puh!)
Neo: Do you always look at it encoded?
Cypher: Well, you have to (...) there's way too
much information to decode the Matrix. You get
used to it. I — I don't even see the code. All I see is
blonde, brunette, redhead...Hey, you want a drink?
To much data
torsdag 23. mai 13
5. What’s new, then?
• Data capacity has doubled every 3-4 years since 1980‘ies!
• We used to have a small amount of interestingdata
• Now we have tons of boring stuff!!
• We must handle so that we
«don’t even see the code»
torsdag 23. mai 13
6. What’s new, then?
• We used algorithms such as apriori and ID3 for log analysis.
Fine for 40MB of data per day.
• In artificial life, there could easily be this amount of data
... per minute.
• Google processed ~24PB of data per day in 2009.
• Your 1.4kg brain can interpret this slide instantly.
torsdag 23. mai 13
7. This is new
• Your braincells solve one little problem
each, they tell 10 other cells about the
result, and then they tell 10 others ... you
get it (fast!)
• Google distributes their computing
...somewhat like your brain.
• They called it MapReduce.
Node 1
Node 1
Node 1
Node 1
Node 1
Node n
Map Reduce
torsdag 23. mai 13
8. You have it at home
• Free MapReduce-a-likes (Hadoop) are cheap in the cloud.
• MySQL is probably not a good choice for BigData analysis.
• There are free NoSQL-databases (Cassandra, Berkeley
DB, MongoDB,++) available.
• Lots of data is freely available to play with. Analyze in the
cloud.
• «The Matrix is everywhere. It is all around us. Even now, in this very room.»
torsdag 23. mai 13
9. That’s it
• Data is growing.
• More information, but harder to find among all
the garbage.
• Free software exists. You can make sense of your
data too.
• Unleash hidden knowledge and work smarter!
torsdag 23. mai 13