Hortonworks  DataFlow
Enterprise  Data  Flow  powered  by  Apache  NiFi
Mats  Johansson
Solutions  Engineer  -­ EMEA
©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Page  2 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
Disclaimer
This  document  may  contain  product  features  and  technology  directions  that  are  under  
development,  may  be  under  development  in  the  future  or  may  ultimately  not  be  
developed.
Project  capabilities  are  based  on  information  that  is  publicly  available  within  the  Apache  
Software  Foundation  project  websites  ("Apache").    Progress  of  the  project  capabilities  
can  be  tracked  from  inception  to  release  through  Apache,  however,  technical  feasibility,  
market  demand,  user  feedback  and  the  overarching  Apache  Software  Foundation  
community  development  process  can  all  effect  timing  and  final  delivery.
This  document’s  description  of  these  features  and  technology  directions  does  not  
represent  a  contractual  commitment,  promise  or  obligation  from  Hortonworks  to  deliver  
these  features  in  any  generally  available  product.
Product  features  and  technology  directions  are  subject  to  change,  and  must  not  be  
included  in  contracts,  purchase  orders,  or  sales  agreements  of  any  kind.
Since  this  document  contains  an  outline  of  general  product  development  plans,  
customers  should  not  rely  upon  it  when  making  purchasing  decisions.
Page  3 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
IoAT Data  Grows  Faster  Than  We  Consume  It
Much  of  the  new  data  
exists  in-­flight,  between  
systems  and  devices  as  
part  of  the  Internet  of  
AnythingNEW
TRADITIONAL
The  Opportunity
Unlock  transformational  business  value
from  a  full  fidelity  of  data  and  analytics
for  all  data.
Geolocation
Server  logs
Files &  emails
ERP,  CRM,  SCM
Traditional  Data  Sources
Internet  of  Anything
Sensors
and machines
Clickstream
Social  media
Page  4 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
Internet  of  Anything  is  Driving  New  Requirements
Need  trusted  insights  from  data  at  the  very  edge  to  the  data  lake  in  real-­
time  with  full-­fidelity
– Data  generated  by  sensors,  machines,  geo-­location  devices,  logs,  clickstreams,  social  feeds,  etc.  
Modern  applications need  access  to  both  data-­in-­motion  and  data-­at-­rest
IoAT data  flows  are  multi-­directional  and  point-­to-­point
– Very  different  than  existing  ETL,  data  movement,  and  streaming  technologies  which  are  generally  one  direction
The  perimeter  is  outside  the  data  center  and  can  be  very  jagged
– This  “Jagged  Edge”  creates  new  opportunity  for  security,  data  protection,  data  governance  and  provenance
Page  5 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
Architectural  Limitations  Today
• Traditional  data  movement  software  has  been  built  for  the  world  of  
standardized data  and  one  way  flows
• Tools  built  for  newer  types  of  data  tend  to  be  custom,  difficult  to  
manage,  and  architecturally  disjoint
• Businesses  can  not  easily  collect,  conduct,  and  curate  secure  multi-­
directional  and  point-­to-­point  IoAT data  flows
• IoAT data  flows  are  not  optimized  and  use  costly/limited  bandwidth  and  
cannot  dynamically  prioritize  the  most  valuable  data
• Difficult  to  gain  actionable  insights  from  the  combination  of  data-­in-­
motion  and  data-­at-­rest
Page   6 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
The  IoAT Data  Flow
Hortonworks  Data  Platform
powered  by  Apache  Hadoop
Hortonworks  Data  Platform
powered  by  Apache  Hadoop
Enrich
Context
Store  Data  
and  Metadata
Internet
of  Anything
Hortonworks  DataFlow  
powered  by  Apache  NiFi
Perishable  
Insights
Historical
Insights
Introducing  Hortonworks  DataFlow
Hortonworks  DataFlow  and  the  Hortonworks  Data  Platform  
deliver  the  industry’s  most  complete  solution  for  management  of  Big  Data.
Page   7 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Simplistic  View  of  IoAT &  Data  Flow
The  Data  Flow  Thing
Process  and  
Analyze  Data
Acquire  Data
Store  Data
Page   8 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Global  interactions  with  customers,  business  partners,  and  things
spanning  different  volume,  velocity,  bandwidth,  and  latency  needs
Realistic  View  of  IoAT and  Data  Flow
Page   9 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Meeting  IoAT Edge  Requirements
GATHE
R
DELIVER
PRIORITIZE
Track  from  the  edge Through  to  the  datacenter
Small  Footprints
operate  with  very  little  power
Limited  Bandwidth
can  create  high  latency
Data  Availability
exceeds  transmission  bandwidth
Data  Must  Be  Secured
throughout  its  journey
Page   10 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Dataflow  requirements  within  the  Data  Center
Understanding
Ability  to  observe  precisely  how  systems  exchange  data  in  real-­time  and  historically
Agility
Ability  to  interact  with  and  alter  live  flows  and  iterate  on  new  ones
Dynamic  Access  Controls
The  entitlements  of  users  and  systems  and  sensitivity  of  data  can  change  frequently
Cross  Cutting  Concerns
Address  common  needs  once  like  enrichment,  filtering,  transformation
Enable  architecture  transition
Legacy  vs modern  is  an  ‘always’  event.    Format,  schema,  protocol  conversion  is  routine
Page  11 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
Apache  NiFi:  Collect,  Conduct,  Curate
Aggregate  all  IoAT data  from  sensors,  geo-­location  devices,  
machines,  logs,  files,  and  feeds  via  a  highly  secure  lightweight  agent
Collect:        Bring  Together• Logs
• Files
• Feeds
• Sensors
Mediate  point-­to-­point  and  bi-­directional  data  flows,  delivering  data  
reliably  to  real-­time  applications  and  storage  platforms  such  as  HDP
Conduct:    Mediate  the  Data  Flow• Deliver
• Secure
• Govern
• Audit
Parse,  filter,  join,  transform,  fork,  and  clone  data  in  motion  to  
empower  analytics  and  perishable  insights
Curate:        Gain  Insights• Parse
• Filter
• Transform
• Fork
• Clone
Page  12 ©  Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved
November  2014
NiFi is  donated  to  the  Apache  Software  Foundation  
(ASF)  through  NSA’s  Technology  Transfer  Program  
and  enters  ASF’s  incubator.
2006
NiagaraFiles (NiFi)  was  first  incepted  by  Joe  Witt  at  
the  National  Security  Agency  (NSA)
A  Brief  History  of  Apache  Nifi
July  2015
NiFi reaches  ASF  top-­level  project  status
Page   13 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Apache  NiFi:  Three  key  concepts
• Manage  the  flow  of  information
• Data  Provenance
• Secure  the  control  plane  and  data  plane
Page   14 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Apache  NiFi  – Key  Features
• Guaranteed  delivery
• Data  buffering  
- Backpressure
- Pressure  release
• Prioritized  queuing
• Flow  specific  QoS
- Latency  vs.  throughput
- Loss  tolerance
• Data  provenance
• Recovery/recording  
a  rolling  log  of  fine-­
grained  history
• Visual  command  and  
control
• Flow  templates
• Pluggable/multi-­role  
security
• Designed  for  extension
• Clustering
Page   15 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Common  Apache  NiFi Use  Cases
Predictive  Analytics
Ensure  the  highest  value  data  is  captured  and  available  for  analysis
Compliance
Gain  full  transparency  into  provenance  and  flow  of  data  
IoT Optimization
Secure,  Prioritize,  Enrich  and  Trace  data  at  the  edge
Fraud  Detection
Move  sales  transaction  data  in  real  time  to  analyze  on  demand  
Big  Data  Ingest
Easily  and  efficiently  ingest  data  into  Hadoop
Value  Resources
Gain  visibility  into  how  data  sources  are  used  to  determine  value
Page   16 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Flow  Based  Programming  (FBP)
FBP  Term NiFi Term Description
Information  
Packet
FlowFile Each object  moving  through  the  system.
Black Box FlowFile  
Processor
Performs  the  work, doing  some  combination  of  data  routing,  
transformation,  or  mediation  between  systems.
Bounded  
Buffer
Connection The  linkage between  processors, acting  as  queues  and  allowing  various  
processes  to  interact  at  differing  rates.
Scheduler Flow  
Controller
Maintains  the  knowledge  of  how  processes  are  connected, and  manages  
the  threads  and  allocations  thereof  which  all  processes  use.
Subnet Process  
Group
A  set  of  processes  and  their  connections,  which  can  receive  and  send  
data  via  ports.  A  process group  allows  creation  of  entirely  new  
component  simply  by  composition  of  its components.
Page   17 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Hortonworks Data  Flow
Visual  User  Interface
HTML  5,  drag  and  drop,  for  agile  execution
High  Throughput,  Low  Bandwidth
for  any  data,  big  or  small
Provenance  Metadata
for  governance  and  compliance
Secure  End-­to-­End  Data  Routing
with  encryption  and  compressionPowered  by  
Apache  NiFi
Page   18 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Basics  of  Connecting  Systems
For  every  connection,  
these  must  agree:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size  of  event
6. Frequency  of  event
7. Authorization  access
8. Relevance
P1
Producer
C1
Consumer
Page   19 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Using  Messaging
Only  a  subset  agree  
using  messaging
1. Protocol
2. Format
3. Schema
4. Priority
5. Size  of  event
6. Frequency  of  event
7. Authorization  access
8. Relevance
P1
CN
C1
Messaging
More  issues  to  consider:
• How  do  you  know  what  the  data  flow  looks  like?  
• How  is  it  managed?
• How  is  it  working  – today,  yesterday?
Page   20 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Using  an  Enterprise  Service  Bus  (ESB)
Still,  only  a  subset  agree  
using  an  ESB:
1. Protocol
2. Format
3. Schema
4. Priority
5. Size  of  event
6. Frequency  of  event
7. Authorization  access
8. Relevance
P1
Broker
CN
C1
Messaging
Even  more  issues  to  consider:
• Remote  procedure  calls  (RPC)  and  throughput  issues  
are  introduced
• Design  and  deploy  management  – slow  setup,  not  interactive
• You  can  scale  out,  but  not  up  or  down
• You  still  don’t  know  what  the  data  flow  looks  like
Page   21 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
OS/Host
JVM
Flow  Controller
Web  Server
Processor  1 Extension  N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local  Storage
OS/Host
JVM
Flow  Controller
Web  Server
Processor  1 Extension  N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local  Storage
Architecture
OS/Host
JVM
NiFi  Cluster  Manager  – Request  Replicator
Web  Server
Master
NiFi  Cluster  
Manager  (NCM)
OS/Host
JVM
Flow  Controller
Web  Server
Processor  1 Extension  N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local  Storage
Slaves
NiFi  Nodes
High  Availability:  Control  plane  vs Data  plane…
Page   22 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Define  A  Hortonworks  DataFlow
• Easy  to  use  drag  and  drop  UI
• Flexible  to  define  the  Data  Flow
Page   23 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
HDF  – Powered  by  Apache  NiFi
Page   24 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Add  processor  for  data  intake
1 Drag  and  drop  processor  icon  from  the  top  menu
Page   25 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Choose  the  specific  processor
2 Choose  one  of  the  processors  – currently  90  available  – designed  for  extension
Page   26 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Example:  Pick  Twitter  Processor
Page   27 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Configure  the  processor
3 Select  processor  and  
choose  option  to  Configure
4
Adjust  
parameters  as  
required
Page   28 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Another  processor  for  data  output
5 Drag  and  drop  processor  icon  from  the  top  menu
6 Example:  choose  PutHDFS processor
Page   29 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Configure  second  processor
7 Configure  2nd processor
Page   30 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Connect  processors,  configure  connection
8
Page   31 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Click  Start  to  begin  processing
9
Page   32 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
See  processors  update  with  real  time  changes
10
As  data  flows,  GUI  interface  updates  in  real  
time.  
Page   33 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Dynamically  adjust  and  tune  data  flow  as  needed
11 Dynamically  adjust  and  tune  dataflow  as  needed,  in  
real  time.  Can  also  replicate  data  for  testing  and  
comparison.  
Page   34 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Understand  the  data  path  with  Data  Provenance
14 Select  Data  Provenance
Page   35 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Trace  lineage  of  a  particular  piece  of  data
15
Icon  for  Data  Lineage
Page   36 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Every  change  to  data  is  tracked:  processing,  views
16
Provenance  event  is  tracked
Page   37 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Updates  as  changes  happen
17 Updates  as  data  flows
Page   38 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Easily  access  and  trace  changes  to  dataflow
Page   39 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Audit  trail  of  Hortonworks  DataFlow User  Actions
Page   40 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Nifi is  complementary  to  Hadoop
Deployment  flexibility  from  devices  to  data  center.  Delivers  data  flow  
QoS across  dimensions  such  as:  loss  tolerant  vs.  guaranteed  
delivery,  low  latency  vs.  high  throughput,  and  priority-­based  
queuing.    
Operations
Governance
Starting  at  the  source,  captures  fine-­grained  metadata  regarding  all  
data  received,  forked,  joined,  cloned,  modified,  sent,  and  ultimately  
dropped  as  data  reaches  its  configured  end-­state  delivering  
comprehensive  governance  (aka  provenance,  chain  of  custody)  
Security
Secures  the  data  movement  from  beginning  to  end.  Allows  for  fine-­
grained  data  authorization  policies  to  be  enforced  at  the  flow-­level.    
Page   41 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Operations
• Reporting  tasks (push)
• Statistics  /  status  (pull)
• Dynamic  flow  changes
- Push  new  business  rules  via  REST  API  
(closed  loop)
- Pull  updates  periodically  from  web  
services
• Site-­to-­site
- Stay  at  the  ‘flow  level’  not  suddenly  
doing  file  transfer  protocols
• Extensible
• Optimized  user  
experience  – log  hunts  
should  be  the  exception
Scale  down,  up,  and  out  – in  
containers  and  on  virtual  machines
Page   42 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
The  Need  for  Data  Provenance
For  Operators
• Traceability,  lineage
• Recovery  and  replay
For  Compliance
• Audit  trail
For  Business
• Value  sources  
• Value  IT  investment
BEGIN
END
LINEAGE
Page   43 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Internet  of  
Anything
Extending  Data  Governance  from  the  Edge  to  Hadoop
ETL   /  DQ MDM
ARCHIVE
Traditional  
Data  Systems
Data  Governance  Requirements
Transparent
Governance  standards  and  
protocols  must  be  clearly  defined  
and  available  to  all
Reproducible
Recreate  the  relevant  data  
landscape  at  a  given  point  in  time
Auditable
Trace all  relevant  events  and  assets  
with  appropriate  historical  lineage
Consistent
Compliance  practices  must  be  
consistent
Hadoop  Data  
Platform
Must  snap  into  existing
data  governance  
frameworks  and  openly
exchange  metadata
SCM
CRM
ERP
Holistic  Data  
Governance
Business  
Analytics
Visualization
&  Dashboards
Page   44 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
The  Need  for  Fine-­grained  Security  and  Compliance
It’s  not  enough  to  say  you  have  
encrypted  communications
• Enterprise  authorization  
services  –entitlements  
change  often
• People  and  systems  with  
different  roles  require  
difference  access  levels
• Tagged/classified  data
Page   45 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Security
Administration
Central  management  and  
consistent  security
• NiFi  Cluster  Manager
Authentication
Authenticate  users  and  systems
• 2-­Way  SSL  support  out  of  the  box;;  additional  types  coming
Authorization
Provision  access  to  data
• Pluggable  authorization  designed  to  fit  any  Identity  and  Access  Management  (IAM)  scheme
• File-­based  authority  provider  out  of  the  box
• Multi-­role
Audit
Maintain  a  record  of  data  access
• Detailed  logging  of  all  user  actions
• Detailed  logging  of  key  system  behaviors
• Data  Provenance  enables  unparalleled  tracking  from  the  edge  through  the  Lake
Data  Protection
Protect  data  at  rest  and  in  motion
• Support  a  variety  of  SSL/encrypted  protocols
• Tag  and  utilize  tags  on  data  for  fine  grained  access  controls
• Encrypt/decrypt  content  using  pre-­shared  key  mechanisms
Administrator Configure  system  threads,  user  
accounts,  and  flow  audit  history
Data  Flow  Manager Manipulate   the  dataflow
Read  Only View  the  dataflow  only
+NiFi Configure  system  threads,  user  
accounts,  and  flow  audit  history
Proxy Manipulate   the  dataflow
Provenance Query  the  provenance  
repository  and  
download content
Page   46 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Page   47 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Operations:  Planned
Page   48 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Page   49 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Page   50 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Planned  Apache  NiFi Enhancements
IN  PROGRESS Enhanced  Configuration  management of  flows
STARTED Extension and  template  registry
TARGETTED  TONIFI  0.4.0  RELEASE First-­class Avro  support1
STARTED Interactive  queue  management
STARTED Multi-­tenant data  flow
FUTURE Pluggable authentication
FUTURE Reference-­able  process groups
FUTURE Variable registry
https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
Page   51 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  ReservedPage   51 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Tweet:  #hadooproadshow
Try  It  Yourself,  
Download  Nifi and  HDP  Sandbox from  
hortonworks.com/sandbox
Tweet:  #hadooproadshow
Page   52 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
Thank  you!
Mats  Johansson
mjohansson@hortonworks.com
@matsjo66
https://se.linkedin.com/in/matsjo66

Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

  • 1.
    Hortonworks  DataFlow Enterprise  Data Flow  powered  by  Apache  NiFi Mats  Johansson Solutions  Engineer  -­ EMEA ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
  • 2.
    Page  2 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved Disclaimer This  document  may  contain  product  features  and  technology  directions  that  are  under   development,  may  be  under  development  in  the  future  or  may  ultimately  not  be   developed. Project  capabilities  are  based  on  information  that  is  publicly  available  within  the  Apache   Software  Foundation  project  websites  ("Apache").    Progress  of  the  project  capabilities   can  be  tracked  from  inception  to  release  through  Apache,  however,  technical  feasibility,   market  demand,  user  feedback  and  the  overarching  Apache  Software  Foundation   community  development  process  can  all  effect  timing  and  final  delivery. This  document’s  description  of  these  features  and  technology  directions  does  not   represent  a  contractual  commitment,  promise  or  obligation  from  Hortonworks  to  deliver   these  features  in  any  generally  available  product. Product  features  and  technology  directions  are  subject  to  change,  and  must  not  be   included  in  contracts,  purchase  orders,  or  sales  agreements  of  any  kind. Since  this  document  contains  an  outline  of  general  product  development  plans,   customers  should  not  rely  upon  it  when  making  purchasing  decisions.
  • 3.
    Page  3 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved IoAT Data  Grows  Faster  Than  We  Consume  It Much  of  the  new  data   exists  in-­flight,  between   systems  and  devices  as   part  of  the  Internet  of   AnythingNEW TRADITIONAL The  Opportunity Unlock  transformational  business  value from  a  full  fidelity  of  data  and  analytics for  all  data. Geolocation Server  logs Files &  emails ERP,  CRM,  SCM Traditional  Data  Sources Internet  of  Anything Sensors and machines Clickstream Social  media
  • 4.
    Page  4 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved Internet  of  Anything  is  Driving  New  Requirements Need  trusted  insights  from  data  at  the  very  edge  to  the  data  lake  in  real-­ time  with  full-­fidelity – Data  generated  by  sensors,  machines,  geo-­location  devices,  logs,  clickstreams,  social  feeds,  etc.   Modern  applications need  access  to  both  data-­in-­motion  and  data-­at-­rest IoAT data  flows  are  multi-­directional  and  point-­to-­point – Very  different  than  existing  ETL,  data  movement,  and  streaming  technologies  which  are  generally  one  direction The  perimeter  is  outside  the  data  center  and  can  be  very  jagged – This  “Jagged  Edge”  creates  new  opportunity  for  security,  data  protection,  data  governance  and  provenance
  • 5.
    Page  5 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved Architectural  Limitations  Today • Traditional  data  movement  software  has  been  built  for  the  world  of   standardized data  and  one  way  flows • Tools  built  for  newer  types  of  data  tend  to  be  custom,  difficult  to   manage,  and  architecturally  disjoint • Businesses  can  not  easily  collect,  conduct,  and  curate  secure  multi-­ directional  and  point-­to-­point  IoAT data  flows • IoAT data  flows  are  not  optimized  and  use  costly/limited  bandwidth  and   cannot  dynamically  prioritize  the  most  valuable  data • Difficult  to  gain  actionable  insights  from  the  combination  of  data-­in-­ motion  and  data-­at-­rest
  • 6.
    Page   6©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved The  IoAT Data  Flow Hortonworks  Data  Platform powered  by  Apache  Hadoop Hortonworks  Data  Platform powered  by  Apache  Hadoop Enrich Context Store  Data   and  Metadata Internet of  Anything Hortonworks  DataFlow   powered  by  Apache  NiFi Perishable   Insights Historical Insights Introducing  Hortonworks  DataFlow Hortonworks  DataFlow  and  the  Hortonworks  Data  Platform   deliver  the  industry’s  most  complete  solution  for  management  of  Big  Data.
  • 7.
    Page   7©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Simplistic  View  of  IoAT &  Data  Flow The  Data  Flow  Thing Process  and   Analyze  Data Acquire  Data Store  Data
  • 8.
    Page   8©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Global  interactions  with  customers,  business  partners,  and  things spanning  different  volume,  velocity,  bandwidth,  and  latency  needs Realistic  View  of  IoAT and  Data  Flow
  • 9.
    Page   9©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Meeting  IoAT Edge  Requirements GATHE R DELIVER PRIORITIZE Track  from  the  edge Through  to  the  datacenter Small  Footprints operate  with  very  little  power Limited  Bandwidth can  create  high  latency Data  Availability exceeds  transmission  bandwidth Data  Must  Be  Secured throughout  its  journey
  • 10.
    Page   10©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Dataflow  requirements  within  the  Data  Center Understanding Ability  to  observe  precisely  how  systems  exchange  data  in  real-­time  and  historically Agility Ability  to  interact  with  and  alter  live  flows  and  iterate  on  new  ones Dynamic  Access  Controls The  entitlements  of  users  and  systems  and  sensitivity  of  data  can  change  frequently Cross  Cutting  Concerns Address  common  needs  once  like  enrichment,  filtering,  transformation Enable  architecture  transition Legacy  vs modern  is  an  ‘always’  event.    Format,  schema,  protocol  conversion  is  routine
  • 11.
    Page  11 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved Apache  NiFi:  Collect,  Conduct,  Curate Aggregate  all  IoAT data  from  sensors,  geo-­location  devices,   machines,  logs,  files,  and  feeds  via  a  highly  secure  lightweight  agent Collect:        Bring  Together• Logs • Files • Feeds • Sensors Mediate  point-­to-­point  and  bi-­directional  data  flows,  delivering  data   reliably  to  real-­time  applications  and  storage  platforms  such  as  HDP Conduct:    Mediate  the  Data  Flow• Deliver • Secure • Govern • Audit Parse,  filter,  join,  transform,  fork,  and  clone  data  in  motion  to   empower  analytics  and  perishable  insights Curate:        Gain  Insights• Parse • Filter • Transform • Fork • Clone
  • 12.
    Page  12 © Hortonworks  Inc.  2011  – 2014.  All  Rights  Reserved November  2014 NiFi is  donated  to  the  Apache  Software  Foundation   (ASF)  through  NSA’s  Technology  Transfer  Program   and  enters  ASF’s  incubator. 2006 NiagaraFiles (NiFi)  was  first  incepted  by  Joe  Witt  at   the  National  Security  Agency  (NSA) A  Brief  History  of  Apache  Nifi July  2015 NiFi reaches  ASF  top-­level  project  status
  • 13.
    Page   13©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Apache  NiFi:  Three  key  concepts • Manage  the  flow  of  information • Data  Provenance • Secure  the  control  plane  and  data  plane
  • 14.
    Page   14©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Apache  NiFi  – Key  Features • Guaranteed  delivery • Data  buffering   - Backpressure - Pressure  release • Prioritized  queuing • Flow  specific  QoS - Latency  vs.  throughput - Loss  tolerance • Data  provenance • Recovery/recording   a  rolling  log  of  fine-­ grained  history • Visual  command  and   control • Flow  templates • Pluggable/multi-­role   security • Designed  for  extension • Clustering
  • 15.
    Page   15©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Common  Apache  NiFi Use  Cases Predictive  Analytics Ensure  the  highest  value  data  is  captured  and  available  for  analysis Compliance Gain  full  transparency  into  provenance  and  flow  of  data   IoT Optimization Secure,  Prioritize,  Enrich  and  Trace  data  at  the  edge Fraud  Detection Move  sales  transaction  data  in  real  time  to  analyze  on  demand   Big  Data  Ingest Easily  and  efficiently  ingest  data  into  Hadoop Value  Resources Gain  visibility  into  how  data  sources  are  used  to  determine  value
  • 16.
    Page   16©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Flow  Based  Programming  (FBP) FBP  Term NiFi Term Description Information   Packet FlowFile Each object  moving  through  the  system. Black Box FlowFile   Processor Performs  the  work, doing  some  combination  of  data  routing,   transformation,  or  mediation  between  systems. Bounded   Buffer Connection The  linkage between  processors, acting  as  queues  and  allowing  various   processes  to  interact  at  differing  rates. Scheduler Flow   Controller Maintains  the  knowledge  of  how  processes  are  connected, and  manages   the  threads  and  allocations  thereof  which  all  processes  use. Subnet Process   Group A  set  of  processes  and  their  connections,  which  can  receive  and  send   data  via  ports.  A  process group  allows  creation  of  entirely  new   component  simply  by  composition  of  its components.
  • 17.
    Page   17©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Hortonworks Data  Flow Visual  User  Interface HTML  5,  drag  and  drop,  for  agile  execution High  Throughput,  Low  Bandwidth for  any  data,  big  or  small Provenance  Metadata for  governance  and  compliance Secure  End-­to-­End  Data  Routing with  encryption  and  compressionPowered  by   Apache  NiFi
  • 18.
    Page   18©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Basics  of  Connecting  Systems For  every  connection,   these  must  agree: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size  of  event 6. Frequency  of  event 7. Authorization  access 8. Relevance P1 Producer C1 Consumer
  • 19.
    Page   19©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Using  Messaging Only  a  subset  agree   using  messaging 1. Protocol 2. Format 3. Schema 4. Priority 5. Size  of  event 6. Frequency  of  event 7. Authorization  access 8. Relevance P1 CN C1 Messaging More  issues  to  consider: • How  do  you  know  what  the  data  flow  looks  like?   • How  is  it  managed? • How  is  it  working  – today,  yesterday?
  • 20.
    Page   20©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Using  an  Enterprise  Service  Bus  (ESB) Still,  only  a  subset  agree   using  an  ESB: 1. Protocol 2. Format 3. Schema 4. Priority 5. Size  of  event 6. Frequency  of  event 7. Authorization  access 8. Relevance P1 Broker CN C1 Messaging Even  more  issues  to  consider: • Remote  procedure  calls  (RPC)  and  throughput  issues   are  introduced • Design  and  deploy  management  – slow  setup,  not  interactive • You  can  scale  out,  but  not  up  or  down • You  still  don’t  know  what  the  data  flow  looks  like
  • 21.
    Page   21©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved OS/Host JVM Flow  Controller Web  Server Processor  1 Extension  N FlowFile Repository Content Repository Provenance Repository Local  Storage OS/Host JVM Flow  Controller Web  Server Processor  1 Extension  N FlowFile Repository Content Repository Provenance Repository Local  Storage Architecture OS/Host JVM NiFi  Cluster  Manager  – Request  Replicator Web  Server Master NiFi  Cluster   Manager  (NCM) OS/Host JVM Flow  Controller Web  Server Processor  1 Extension  N FlowFile Repository Content Repository Provenance Repository Local  Storage Slaves NiFi  Nodes High  Availability:  Control  plane  vs Data  plane…
  • 22.
    Page   22©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Define  A  Hortonworks  DataFlow • Easy  to  use  drag  and  drop  UI • Flexible  to  define  the  Data  Flow
  • 23.
    Page   23©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved HDF  – Powered  by  Apache  NiFi
  • 24.
    Page   24©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Add  processor  for  data  intake 1 Drag  and  drop  processor  icon  from  the  top  menu
  • 25.
    Page   25©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Choose  the  specific  processor 2 Choose  one  of  the  processors  – currently  90  available  – designed  for  extension
  • 26.
    Page   26©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Example:  Pick  Twitter  Processor
  • 27.
    Page   27©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Configure  the  processor 3 Select  processor  and   choose  option  to  Configure 4 Adjust   parameters  as   required
  • 28.
    Page   28©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Another  processor  for  data  output 5 Drag  and  drop  processor  icon  from  the  top  menu 6 Example:  choose  PutHDFS processor
  • 29.
    Page   29©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Configure  second  processor 7 Configure  2nd processor
  • 30.
    Page   30©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Connect  processors,  configure  connection 8
  • 31.
    Page   31©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Click  Start  to  begin  processing 9
  • 32.
    Page   32©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved See  processors  update  with  real  time  changes 10 As  data  flows,  GUI  interface  updates  in  real   time.  
  • 33.
    Page   33©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Dynamically  adjust  and  tune  data  flow  as  needed 11 Dynamically  adjust  and  tune  dataflow  as  needed,  in   real  time.  Can  also  replicate  data  for  testing  and   comparison.  
  • 34.
    Page   34©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Understand  the  data  path  with  Data  Provenance 14 Select  Data  Provenance
  • 35.
    Page   35©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Trace  lineage  of  a  particular  piece  of  data 15 Icon  for  Data  Lineage
  • 36.
    Page   36©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Every  change  to  data  is  tracked:  processing,  views 16 Provenance  event  is  tracked
  • 37.
    Page   37©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Updates  as  changes  happen 17 Updates  as  data  flows
  • 38.
    Page   38©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Easily  access  and  trace  changes  to  dataflow
  • 39.
    Page   39©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Audit  trail  of  Hortonworks  DataFlow User  Actions
  • 40.
    Page   40©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Nifi is  complementary  to  Hadoop Deployment  flexibility  from  devices  to  data  center.  Delivers  data  flow   QoS across  dimensions  such  as:  loss  tolerant  vs.  guaranteed   delivery,  low  latency  vs.  high  throughput,  and  priority-­based   queuing.     Operations Governance Starting  at  the  source,  captures  fine-­grained  metadata  regarding  all   data  received,  forked,  joined,  cloned,  modified,  sent,  and  ultimately   dropped  as  data  reaches  its  configured  end-­state  delivering   comprehensive  governance  (aka  provenance,  chain  of  custody)   Security Secures  the  data  movement  from  beginning  to  end.  Allows  for  fine-­ grained  data  authorization  policies  to  be  enforced  at  the  flow-­level.    
  • 41.
    Page   41©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Operations • Reporting  tasks (push) • Statistics  /  status  (pull) • Dynamic  flow  changes - Push  new  business  rules  via  REST  API   (closed  loop) - Pull  updates  periodically  from  web   services • Site-­to-­site - Stay  at  the  ‘flow  level’  not  suddenly   doing  file  transfer  protocols • Extensible • Optimized  user   experience  – log  hunts   should  be  the  exception Scale  down,  up,  and  out  – in   containers  and  on  virtual  machines
  • 42.
    Page   42©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved The  Need  for  Data  Provenance For  Operators • Traceability,  lineage • Recovery  and  replay For  Compliance • Audit  trail For  Business • Value  sources   • Value  IT  investment BEGIN END LINEAGE
  • 43.
    Page   43©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Internet  of   Anything Extending  Data  Governance  from  the  Edge  to  Hadoop ETL   /  DQ MDM ARCHIVE Traditional   Data  Systems Data  Governance  Requirements Transparent Governance  standards  and   protocols  must  be  clearly  defined   and  available  to  all Reproducible Recreate  the  relevant  data   landscape  at  a  given  point  in  time Auditable Trace all  relevant  events  and  assets   with  appropriate  historical  lineage Consistent Compliance  practices  must  be   consistent Hadoop  Data   Platform Must  snap  into  existing data  governance   frameworks  and  openly exchange  metadata SCM CRM ERP Holistic  Data   Governance Business   Analytics Visualization &  Dashboards
  • 44.
    Page   44©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved The  Need  for  Fine-­grained  Security  and  Compliance It’s  not  enough  to  say  you  have   encrypted  communications • Enterprise  authorization   services  –entitlements   change  often • People  and  systems  with   different  roles  require   difference  access  levels • Tagged/classified  data
  • 45.
    Page   45©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Security Administration Central  management  and   consistent  security • NiFi  Cluster  Manager Authentication Authenticate  users  and  systems • 2-­Way  SSL  support  out  of  the  box;;  additional  types  coming Authorization Provision  access  to  data • Pluggable  authorization  designed  to  fit  any  Identity  and  Access  Management  (IAM)  scheme • File-­based  authority  provider  out  of  the  box • Multi-­role Audit Maintain  a  record  of  data  access • Detailed  logging  of  all  user  actions • Detailed  logging  of  key  system  behaviors • Data  Provenance  enables  unparalleled  tracking  from  the  edge  through  the  Lake Data  Protection Protect  data  at  rest  and  in  motion • Support  a  variety  of  SSL/encrypted  protocols • Tag  and  utilize  tags  on  data  for  fine  grained  access  controls • Encrypt/decrypt  content  using  pre-­shared  key  mechanisms Administrator Configure  system  threads,  user   accounts,  and  flow  audit  history Data  Flow  Manager Manipulate   the  dataflow Read  Only View  the  dataflow  only +NiFi Configure  system  threads,  user   accounts,  and  flow  audit  history Proxy Manipulate   the  dataflow Provenance Query  the  provenance   repository  and   download content
  • 46.
    Page   46©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
  • 47.
    Page   47©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Operations:  Planned
  • 48.
    Page   48©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
  • 49.
    Page   49©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved
  • 50.
    Page   50©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Planned  Apache  NiFi Enhancements IN  PROGRESS Enhanced  Configuration  management of  flows STARTED Extension and  template  registry TARGETTED  TONIFI  0.4.0  RELEASE First-­class Avro  support1 STARTED Interactive  queue  management STARTED Multi-­tenant data  flow FUTURE Pluggable authentication FUTURE Reference-­able  process groups FUTURE Variable registry https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
  • 51.
    Page   51©  Hortonworks  Inc.  2011  – 2015.  All  Rights  ReservedPage   51 ©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Tweet:  #hadooproadshow Try  It  Yourself,   Download  Nifi and  HDP  Sandbox from   hortonworks.com/sandbox Tweet:  #hadooproadshow
  • 52.
    Page   52©  Hortonworks  Inc.  2011  – 2015.  All  Rights  Reserved Thank  you! Mats  Johansson mjohansson@hortonworks.com @matsjo66 https://se.linkedin.com/in/matsjo66