Taewook Eom
Data Infrastructure Team
SK planet
2014-01-28
Taewook Eom
Data Programmer
Plaster(Planet Master)
of Big Data Infra
Pre-Assessor of Hiring Programmers
Mentor of 101 Star...
Santa Clara
: Technical

New York
with Cloudera

: Financial, Business

Europe

: Privacy, Government

Boston
: Medical

h...
Data
When hardware became commoditized,
software was valuable.
Now software being commoditized,
data is valuable.
– Tim O’...
What is Big Data?
All data that is not a fit for a traditional RDBMS,
whether used for OLTP or Analytics purposes

Big Dat...
Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data
- Gartner, 2011

http://blog.vitria.com/Port...
http://image-store.slidesharecdn.com/ae63030a-3d9b-11e3-9cff-22000a970267-original.jpg
Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

http://en.wikipedia.org/wiki/File:DataSci...
Big Data

http://mappingignorance.org/fx/media/2013/07/Figura-11.jpg

Open Mind!
Big Data

Gartner's 2013 Hype Cycle for Emerging Technologies (2013-08-19)
more than half of
technical sessions
are presented by
Chinese or Indian

39 of 125 sessions are
sponsored sessions
Big Data: 4 Approaches
Hadoop-based

RDB-based

Search-based

NoSQL
Real-time Processing

Real-time Recommendations for Retail: Architecture, Algorithms, and Design
http://strataconf.com/str...
Real-time Stream Processing
Apache
Kafka

Gathering

Apache
Storm

Processing
Querying

Streaming
Search-based
NoSQL
SQL

...
… not yet Graph Processing
Big Data Space
No one tools is the right fit for all Big Data problem
Do not be afraid to recommend the right solution
for...
Practical Performance Analysis and Tuning for Cloudera Impala
http://strataconf.com/stratany2013/public/schedule/detail/30...
Hadoop and the Relational Data Warehouse – When to Use Which?
http://strataconf.com/stratany2013/public/schedule/detail/30...
Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS
http://strataconf.com/stratany2013/public/schedule/detail/29968
Ignite
Signal Detection Theory: Man vs Machine
Co-Founder @VividCortex
Kyle Redinger
http://www.youtube.com/watch?v=Fg6mN-...
Signal Detection Theory: Man vs Machine

Remove the obvious and look at what is important
Remember: Less is more.
Keynote
Towards Strata 2014
Director of market research at O’Reilly Media
Roger Magoulas
http://www.youtube.com/watch?v=Yt...
Towards Strata 2014
Towards Strata 2014
Towards Strata 2014
Towards Strata 2014
Science is fundamentally about data,
but data is not fundamentally about science
Beyond R and Ph.D.s: The Mythology of Dat...
People

A data scientist is a data analyst who lives in California.

– George Roumeliotis, (Intuit)
http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
Data
Data
Data
Data

Businessperson: Business person, Leader, Entrepreneur
Creative: Artist, Jack-of-All-Trades, Hacker
Re...
Scientists think they can code,
software engineers think they are scientists.
Team them up so they collaborate.

– Scott S...
How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce
http://strataconf.com/stratany201...
Data scientists spend their lives as data janitors
instead of leveraging their skills

– Wes McKinney (DataPad)

Building ...
Keynote
Is Bigger Really Better?
Predictive Analytics
with Fine-grained Behavior Data
Professor at the NYU Stern School of...
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data

Predictive does not mean actionable.

– Sco...
More data gives you more precision, not more prediction.
Using multiple datasets to reduce errors when measuring values.
I...
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Is Bigger Really Better?
Predictive Analytics with Fine-grained Behavior Data
Keynote
Big Impact from Big Data
Head of Analytics at Facebook
Ken Rudin
http://www.youtube.com/watch?v=RJFwsZwTBgg
(11 mi...
Big Impact from Big Data
Hadoop is a hammer,
but you need other tools along with it.

Designing Your Data-Centric Organization
Josh Klahr (Pivotal)...
Big Impact from Big Data

The way you organize information
depends on the question
you intend to ask of it.

- Richard Sau...
HaDump

: Loading data into Hadoop
for not reason.

Data Science Without a Scientist
http://strataconf.com/stratany2013/pu...
Big Impact from Big Data

Technical people still don't understand the business needs of business people!
Business people d...
Ask the Right Questions
Organizations already have people who know their own data
better than mystical data scientists.
Le...
Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative
http://strataconf.com/stratany2013/public/s...
Every Soldier is a Sensor: Countering Corruption in Afghanistan
http://strataconf.com/stratany2013/public/schedule/detail/...
Big Impact from Big Data
Big Impact from Big Data
Big Impact from Big Data
Value of Data
Usable < Useful < Actionable
with Impact

If you can't answer for "so what?",
you only have facts, not insig...
The Future of Hadoop
: What Happened
& What's Possible?
Co-Founder of Hadoop
Doug Cutting
http://www.youtube.com/watch?v=_...
Hadoop's Impact on the Future of Data Management
Mike Olson (Cloudera)

http://www.youtube.com/watch?v=puHS2JNKgRM
http://...
Single
:
:
:
:
:
:

S/W & H/W system
security model
management model
metadata model
audit model
resource
management model
...
Last generation of data management is not sufficient
More copies, representations, transformations increase risk
Index onc...
Data Intelligence
Rethink How You See Data

Sharmila Shahani-Mulligan (ClearStory Data)

http://www.youtube.com/watch?v=07...
The Data Availability Problem

?

Access

Question
Sampling

Analysis & Disc
Modeling
overy

Loading
Insight

Data Prep – ...
Running Non-MapReduce Big Data applications on Apache Hadoop
http://strataconf.com/stratany2013/public/schedule/detail/307...
Apache HBase for Architects
http://strataconf.com/stratany2013/public/schedule/detail/30619
What’s Next for Apache HBase: ...
Securing the Apache Hadoop Ecosystem
http://strataconf.com/stratany2013/public/schedule/detail/30302
An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB
http://strat...
Schema
Information does not exist until a schema is defined
and data is stored in a relational database

- anonymous

Buil...
Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA)
http://strataconf.com/stratan...
Managing a Rapidly Evolving Analytics Pipeline
http://strataconf.com/stratany2013/public/schedule/detail/30635
Stringer/Tez

Shark

SQL on/in Hadoop/Hbase Solutions

Perception is Key: Telescopes, Microscopes and Data
http://strataco...
All SQL on Hadoop Solutions are
Missing the Point of Hadoop
Every Solution makes you define a schema

- SQL(Structured Que...
Lessons Learned
Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
Hadoop Adventures At Spotify
http://strataconf.com/stratany2013/public/schedule/detail/30570
Quick prototyping is the fastest way to internal advocacy. Ship It!
Cloud == Speed
We don’t always need a complicated solu...
Questions?
SELECT questions FROM audience;
References
Strata Conference + Hadoop World 2013 Keynotes & Interviews

http://www.youtube.com/playlist?list=PL055Epbe6d5Z...
Strata Conference NYC 2013
Upcoming SlideShare
Loading in...5
×

Strata Conference NYC 2013

315

Published on

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
315
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
13
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Strata Conference NYC 2013

  1. 1. Taewook Eom Data Infrastructure Team SK planet 2014-01-28
  2. 2. Taewook Eom Data Programmer Plaster(Planet Master) of Big Data Infra Pre-Assessor of Hiring Programmers Mentor of 101 Startup Korea Twitter: @taewooke LinkedIn: http://kr.linkedin.com/in/taewookeom http://www.flickr.com/photos/oreillyconf/10616622085/
  3. 3. Santa Clara : Technical New York with Cloudera : Financial, Business Europe : Privacy, Government Boston : Medical http://strataconf.com/ by O’Reilly Web 2.0 : Open, Sharing, Participation Big Data : Making Data Work Change the World with Data.
  4. 4. Data When hardware became commoditized, software was valuable. Now software being commoditized, data is valuable. – Tim O’Reilly, 2011 Data is like the blood of the enterprise. – Amr Awadallah, CTO at Cloudera, 2013
  5. 5. What is Big Data? All data that is not a fit for a traditional RDBMS, whether used for OLTP or Analytics purposes Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397
  6. 6. Solving 'Big Data' Challenge Involves More Than Just Managing Volumes of Data - Gartner, 2011 http://blog.vitria.com/Portals/47881/images/3values-resized-600.png
  7. 7. http://image-store.slidesharecdn.com/ae63030a-3d9b-11e3-9cff-22000a970267-original.jpg
  8. 8. Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  9. 9. Data Science http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png
  10. 10. Big Data http://mappingignorance.org/fx/media/2013/07/Figura-11.jpg Open Mind!
  11. 11. Big Data Gartner's 2013 Hype Cycle for Emerging Technologies (2013-08-19)
  12. 12. more than half of technical sessions are presented by Chinese or Indian 39 of 125 sessions are sponsored sessions
  13. 13. Big Data: 4 Approaches Hadoop-based RDB-based Search-based NoSQL
  14. 14. Real-time Processing Real-time Recommendations for Retail: Architecture, Algorithms, and Design http://strataconf.com/stratany2013/public/schedule/detail/30217
  15. 15. Real-time Stream Processing Apache Kafka Gathering Apache Storm Processing Querying Streaming Search-based NoSQL SQL Stringer/Tez Shark
  16. 16. … not yet Graph Processing
  17. 17. Big Data Space No one tools is the right fit for all Big Data problem Do not be afraid to recommend the right solution for the problem over the popular solution To do this, you must be aware of the entire ecosystem Big Data Architectural Patterns http://strataconf.com/stratany2013/public/schedule/detail/30397
  18. 18. Practical Performance Analysis and Tuning for Cloudera Impala http://strataconf.com/stratany2013/public/schedule/detail/30551
  19. 19. Hadoop and the Relational Data Warehouse – When to Use Which? http://strataconf.com/stratany2013/public/schedule/detail/30964
  20. 20. Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  21. 21. Ignite Signal Detection Theory: Man vs Machine Co-Founder @VividCortex Kyle Redinger http://www.youtube.com/watch?v=Fg6mN-jevds (5 minutes 6 seconds) http://www.slideshare.net/realkyleredinger/man-vs-machine-signal-detection-theory-and-big-data
  22. 22. Signal Detection Theory: Man vs Machine Remove the obvious and look at what is important Remember: Less is more.
  23. 23. Keynote Towards Strata 2014 Director of market research at O’Reilly Media Roger Magoulas http://www.youtube.com/watch?v=Ytd5VkEgQf8 (5 minutes 26 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31935 http://www.oreilly.com/data/free/files/stratasurvey.pdf
  24. 24. Towards Strata 2014
  25. 25. Towards Strata 2014
  26. 26. Towards Strata 2014
  27. 27. Towards Strata 2014
  28. 28. Science is fundamentally about data, but data is not fundamentally about science Beyond R and Ph.D.s: The Mythology of Data Science Debunked Douglas Merrill (ZestFinance) http://www.youtube.com/watch?v=J2sgObXbIWY (8 minutes 9 seconds)
  29. 29. People A data scientist is a data analyst who lives in California. – George Roumeliotis, (Intuit)
  30. 30. http://www.anlytcs.com/2014/01/data-science-venn-diagram-v20.html
  31. 31. Data Data Data Data Businessperson: Business person, Leader, Entrepreneur Creative: Artist, Jack-of-All-Trades, Hacker Researcher: Scientist, Researcher, Statistician Engineer: Engineer, Developer http://datacommunitydc.org/blog/2012/08/data-scientists-survey-results-teaser/ http://cdn.oreillystatic.com/oreilly/radarreport/0636920029014/Analyzing_the_Analyzers.pdf
  32. 32. Scientists think they can code, software engineers think they are scientists. Team them up so they collaborate. – Scott Sorenson (Ancestry.com) Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
  33. 33. How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707
  34. 34. Data scientists spend their lives as data janitors instead of leveraging their skills – Wes McKinney (DataPad) Building More Productive Data Science and Analytics Workflows
  35. 35. Keynote Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data Professor at the NYU Stern School of Business Foster Provost http://www.youtube.com/watch?v=1jzMiAfLH2c (10 minutes 16 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31685
  36. 36. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  37. 37. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  38. 38. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data Predictive does not mean actionable. – Scott Sorenson (Ancestry.com) Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop
  39. 39. More data gives you more precision, not more prediction. Using multiple datasets to reduce errors when measuring values. Is Bigger Really Better? - Ravi Iyer (Ranker.com) Predictive Analytics with Fine-grained Understand yourData Users, and Employees Behavior Customers, Using Graphs of Data to
  40. 40. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  41. 41. Is Bigger Really Better? Predictive Analytics with Fine-grained Behavior Data
  42. 42. Keynote Big Impact from Big Data Head of Analytics at Facebook Ken Rudin http://www.youtube.com/watch?v=RJFwsZwTBgg (11 minutes 57 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31903
  43. 43. Big Impact from Big Data
  44. 44. Hadoop is a hammer, but you need other tools along with it. Designing Your Data-Centric Organization Josh Klahr (Pivotal) http://www.youtube.com/watch?v=D86udfrVzrI (12 minutes)
  45. 45. Big Impact from Big Data The way you organize information depends on the question you intend to ask of it. - Richard Saul Wurman Building a Data Platform
  46. 46. HaDump : Loading data into Hadoop for not reason. Data Science Without a Scientist http://strataconf.com/stratany2013/public/schedule/detail/31801
  47. 47. Big Impact from Big Data Technical people still don't understand the business needs of business people! Business people don't know what's a table. - Anurag Tandon (MicroStrategy) Inject Big Data into your Corporate DNA: Enable Every Employee to Make Data Driven Decisions
  48. 48. Ask the Right Questions Organizations already have people who know their own data better than mystical data scientists. Learning Hadoop is easier than learning the company’s business. - Gartner, 2012 Defining your Big Data Arsenal: NoSQL, Hadoop, and RDBMS http://strataconf.com/stratany2013/public/schedule/detail/29968
  49. 49. Non-linear Storytelling: Towards New Methods and Aesthetics for Data Narrative http://strataconf.com/stratany2013/public/schedule/detail/30207
  50. 50. Every Soldier is a Sensor: Countering Corruption in Afghanistan http://strataconf.com/stratany2013/public/schedule/detail/30828
  51. 51. Big Impact from Big Data
  52. 52. Big Impact from Big Data
  53. 53. Big Impact from Big Data
  54. 54. Value of Data Usable < Useful < Actionable with Impact If you can't answer for "so what?", you only have facts, not insight - Baron Schwartz (VividCortex Inc) Making Big Data Small Descriptive (Easy) Predictive (Medium) Prescriptive (Hard) What happened? What will happen? What should we do about it? Hadoop & Data Science for the Enterprise
  55. 55. The Future of Hadoop : What Happened & What's Possible? Co-Founder of Hadoop Doug Cutting http://www.youtube.com/watch?v=_WwuZI6AhN8 (14 minutes 41 seconds) http://strataconf.com/stratany2013/public/ schedule/detail/31591 Big Data is first industry that was created by open source. - Jack Norris (MapR Technologies) Separating Hadoop Myths from Reality Hadoop the kernel of the OS for data.
  56. 56. Hadoop's Impact on the Future of Data Management Mike Olson (Cloudera) http://www.youtube.com/watch?v=puHS2JNKgRM http://strataconf.com/stratany2013/public/schedule/detail/31380
  57. 57. Single : : : : : : S/W & H/W system security model management model metadata model audit model resource management model Common : storage & schema http://www.slideshare.net/cloudera/enterprise-data-hub-the-next-big-thing-in-big-data
  58. 58. Last generation of data management is not sufficient More copies, representations, transformations increase risk Index once and reuse across workloads, lifecycle NoSQL: indexing and updates for interactive apps Hadoop: staging, persistence, and analytics Data Governance for Regulated Industries Using Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30738
  59. 59. Data Intelligence Rethink How You See Data Sharmila Shahani-Mulligan (ClearStory Data) http://www.youtube.com/watch?v=07hGulTOZGk (9 minutes 6 seconds) http://strataconf.com/stratany2013/public/schedule/detail/31742
  60. 60. The Data Availability Problem ? Access Question Sampling Analysis & Disc Modeling overy Loading Insight Data Prep – too slow! Information Supply Chain Introducing a New Way to Interact with Insight http://strataconf.com/stratany2013/public/schedule/detail/31743 Presentation
  61. 61. Running Non-MapReduce Big Data applications on Apache Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30755
  62. 62. Apache HBase for Architects http://strataconf.com/stratany2013/public/schedule/detail/30619 What’s Next for Apache HBase: Multi-tenancy, Predictability, and Extensions. http://strataconf.com/stratany2013/public/schedule/detail/30857
  63. 63. Securing the Apache Hadoop Ecosystem http://strataconf.com/stratany2013/public/schedule/detail/30302
  64. 64. An Introduction to the Berkeley Data Analytics Stack With Spark, Spark Streaming, Shark, Tachyon, and BlinkDB http://strataconf.com/stratany2013/public/schedule/detail/30959
  65. 65. Schema Information does not exist until a schema is defined and data is stored in a relational database - anonymous Building a Data Platform http://strataconf.com/stratany2013/public/schedule/detail/31400
  66. 66. Lessons Learned From A Decade’s Worth of Big Data At The U.S. National Security Agency (NSA) http://strataconf.com/stratany2013/public/schedule/detail/30913
  67. 67. Managing a Rapidly Evolving Analytics Pipeline http://strataconf.com/stratany2013/public/schedule/detail/30635
  68. 68. Stringer/Tez Shark SQL on/in Hadoop/Hbase Solutions Perception is Key: Telescopes, Microscopes and Data http://strataconf.com/strataeu2013/public/schedule/detail/32351
  69. 69. All SQL on Hadoop Solutions are Missing the Point of Hadoop Every Solution makes you define a schema - SQL(Structured Query Language) is expressed over an assumed schema Major reasons why Hadoop has taken of include: - Ability to load data without defining a schema - Process data using schema-on-read instead of first defining a schema Hadoop contains a lot of: - Raw, granular data sets with potentially inconsistent schemas - Data sets in JSON, key-value, and other self-describing (non-relational) models designed for schema-on-read processing SQL on Hadoop solutions that make you first define a schema are missing a major part of Hadoop’s usage patterns Flexible Schema and the End of ETL http://strataconf.com/stratany2013/public/schedule/detail/31868
  70. 70. Lessons Learned
  71. 71. Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570
  72. 72. Hadoop Adventures At Spotify http://strataconf.com/stratany2013/public/schedule/detail/30570
  73. 73. Quick prototyping is the fastest way to internal advocacy. Ship It! Cloud == Speed We don’t always need a complicated solution. KISS Play to your differentiating strengths. Experience >> Data Bias towards impact. It Takes a Village EASE!! (Emulate, Analyze, Scale, Evaluate) How Nordstrom Utilizes Human Intelligence to Blend Brick-and-Mortar with Online Commerce http://strataconf.com/stratany2013/public/schedule/detail/30707 Prototyping is key to overcoming resistance to change Technical architecture is heavily influenced by people organization Developing a team of experienced Hadoop users can often be done using internal employees A culture of experimentation and innovation yields the best result Ancestry.com: Managing Big Data Reaching Back to the 11th Century with Hadoop http://strataconf.com/stratany2013/public/schedule/detail/30499
  74. 74. Questions? SELECT questions FROM audience;
  75. 75. References Strata Conference + Hadoop World 2013 Keynotes & Interviews http://www.youtube.com/playlist?list=PL055Epbe6d5ZtziVAooUC04i1hL_Z9Xvk Slides & Video http://strataconf.com/stratany2013/public/schedule/proceedings Tweets https://twitter.com/search?q=%23strataconf #strataconf
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×