== VMWare: Apache Hadoop and Virtual MachinesThe idea is to place multiple Virtual Machines on cluster's machines and place Hadoop nodes on those VMs.
You can grow and shrink a single VM to accommodate the load:
It's hard to add and remove the nodes because of losing the state - this is a problem.
And the big idea is to keep HDFS on physical hosts and place the MapReduce nodes in VMs - in this case the state is not lost and task nodes can be added/removed at will.
== Analytical Queries with Hive: SQL Windowing and Table FunctionsAmazingly uninteresting session - the presenter didn't talk about high level problem statement but gave too much internal implementations. Or maybe I am just not qualified enough to understand it. Anyway there are a couple of screenshots:
== Starving for Juice (intermission)
As I mentioned before there are only few places where you can charge your laptop. And the Macbook is always hungry...
== Creating Histograms from a Data Stream via MapReduceThe problem is that the data comes as a stream, it's hard to get min and max. And the data is distributed.
Solution: Partition Incremental Discretization (PiD). Which give an approximate histogram.
== Hadoop Best Practices in the CloudThe presenter was previously at Amazon so he presents AWS.
Elastic MapReduce jobs run on a cluster or clusters.
It can be provisioned to 1000 MapReduce instances in 5 minutes.
You can have lots of clusters.
Instead HDFS the data is read from S3. S3 is more durable - 11 9's of durability and 99.9% availability.
Why HDFS is faster for scanning, so it's better to use hdfs like a cache for temporary data while keeping permanent data in S3.
Also S3 can be used for migrating HDFS clusters from older version to new version
Steps and Bootstrap actions
Steps are MapReduce tasks or can be Hive or Pig scripts. They are stored in S3
BootStrap actions are running in the beginning of each workflow. The are also stored in S3
Complex Workflows can be written in scripting languages like python. Alternatively AWS Flow Framework can be used (it seems this is something like cascading)
There are development, testing and production stacks. You can develop your hadoop programs in isolation and point to different S3 storages.
Validating Computations technique. Do computations twice using different logic to test correctness of the main logic.
Performance tuning is achieved surprisingly by switching from one instance type to another with different memory and disk capacity:
There are different models of payment for computing power:
- on demand
- spot marked
- reserved instances
== Hadoop in Education
MOOCs (massive open source courses)
In Hadoop we can store:
- Student logs
- Student Assessments
- Faculty interaction logs
- Student/Faculty interaction logs
- Processed Content Usage
- Those logs must be provided w/o much of developers effort
- How to feed this data to Hive
- How to sessionize this data
- Joins of big data