Hive is an all-in-one project management tool developed to “help teams move faster” regardless of how they work. Features are created based on users’ requests and are updated weekly, making Hive the world’s first democratic software platform. It’s best known for its capabilities in project management, time management, team collaboration, automation, and an array of integrations with third-party software. Hive is free to use for solo users and with premium versions available to teams and enterprises.
Capabilities |
|
---|---|
Segment |
|
Deployment | Cloud / SaaS / Web-Based, Mobile Android, Mobile iPad, Mobile iPhone |
Support | 24/7 (Live rep), Chat, Email/Help Desk, FAQs/Forum, Knowledge Base, Phone Support |
Training | Documentation |
Languages | English |
Ease of use as well as ability to scale. It has proven its reliability. They have continued to add more features and increased its speed at the same time.
Speed is still slower compared to newer distributed warehouses. Also, it still uses mapreduce behind the scene which is very slow in the present days.
Storing large amount of data that could not fit in to any relational database system. Being able to derive valuable insight into our data by running mapreduce jobs on data stored in Hive.
To be able to run map reduce jobs using json parsing and generate dynamic partitions in parquet file format.
It is slow compared to Spark/Impala for most operations. Also, it throws Out of Memory if multiple partitions are updated containing many parquet files.
Events are gathered in HDFS by flume and needs to be processed into parquet files for fast querying. The input data contains variable attributes in the json payload as each customer could define custom attributes. It is part of the ETL pipeline, where hive jobs read json data and generates parquet files that would be queried using impala/spark. Using views, each customer queries only the relevant data.
-> Easy to configure/create a table for Big data or Streaming data -> Fast and easy to Query. Business/non-technical folks can use HUE for more interactive Querying on Hive tables -> HUE can have saved results and old queries along with exporting results in Excell.CSV
-> Joining and parsing multiple tables with huge sizes still remain a challenge. -> Some of SQL operations doesn't work in hive like non equality joins,
Dumping Site activity Big Data streaming data as well as data logs in Hive
Open source framework allows to read and write and manage tha data like sql , HQl which makes it easy to use.
The latency in Apache hive is very high.
It provides SQL like query language called HQl with schema on read and transparently convert queries to map reduce.
I like the most in Apache Hive supports partitioning and bucketing for fast data retrieval. We can create custom UDF as per the requirements to perform data cleansing and filtering. It supports HQL similar to SQL which gives easy for the people who comes from SQL background.
Doesn't support OLTP and also doesn't support delete or update actions.
We have created a semantic layer in Hive that helps us to process the terabytes of data and generate the reports faster. it also helped us fault tolerance and high availability of the data
Flexible and easy to understand loved it
Not compatible with multiple platforms hence mostly plotform depend
Works best with ETL related works or tasks
It is easy to run query in hive as hive uses hql which is very similar to sql. Hive has hivemetastore service to save the metadata and hiveserver2 to serve the client requests so the segregation here helps in proper resource distribution. Hive is also fault tolerant which makes it ideal to run ETL long running batches
Hive has a problem of cold start and since it used mapreduce algorithm at the backend, it is way slower than spark which made us move to spark from hive as the job completion time after switching to spark got reduced by 70-80%
Informatica data ingestion Abinitio data ingestion and modifications Data formatting (as it provides option such as csv,parauet etc) Data transformation using hive query Data pipelines
The friendliness of the data warehouse tool for the database developers
Not inclusion of acid properties, it doesn't have the acid properties as in the databases
I usually use hive for my big data [data migration problems], the speed at which the query operates, and the option to choose various engines
If you know how to write sql statements you can write hiveql and it doesn't require you to learn anything new,its pretty straightforward
Performance tuning is difficult and becomes hard for complex queries, it still has a few bugs like all the data going to single reducer, which might lead to slow down the query results.
For developing reports for business analysts, lot of them know sql statements so its easy to write and pull information for analysis
- Easy to learn - Can query complex data including nested structures. - Flexible (wrt data schema) - With ORC SerDe, I/O can be reduced drastically - by reading only what is required (columnar formats).
- Needs schema to be defined in prior. - Not ANSI SQL compliant. - Not suitable for fast interactive queries, even on moderate size datasets. - Works only with Hadoop (not an independent query-processing tool) - Not enterprise grade w.r.t quality of documentation, error messages, support
Exploring ways to store and process semantic datasets
Hive its a data warehousing infrastructure built on top of Hadoop to provide data grouping, querying, and analysis.Apache Hive soporta el análisis de grandes conjuntos de datos almacenados bajo HDFS de Hadoop y en sistemas compatibles como el sistema de archivos Amazon.It offers a SQL-based query language called HiveQL5 with schemas to transparently read and convert queries in MapReduce, Apache Tez6, and Spark tasks. All three execution engines can run under YARN. To speed up queries, Hive provides indexes, which include bitmap indexes.
Offers many tools, has great growth potential
Possibility of storing metadata in an organized and easily accessible way.
Hive syntax is almost like sql, so for someone already familiar with sql it takes almost no effort to pick up hive. But there are other tools that can do the same thing faster these days. Hive initially was really good to have; but more and more projects are now available to do SQL like operations on Big Data (like Drill).
Hive is comparatively slower than its competitors. Its easy to use but that comes with the cost of processing, If you are using it just for batch processing then hive is well and fine. It also does not have as rich of a scripting language.
In Retail, the business partners are more comfortable querying their own data instead of relying on Engineers. Hive solves one of that problems. The main purpouse of using Hive is to building reports and do analysis of data that is stored in the Hadoop file system.
nothing in particular. helps us with big data and allows all users to have unrestricted bandwidth, but we already ran into issues with that, so now one of the servers has limitations.
. at my company it was fairly troublesome getting access since it's underlying warehouseing is in hadoop, then have to connect through hive
data insights with big browser data through mapreduce
Easy SQL like syntax for very short and simple queries
No alias for relation. No flow controls as well.
I build machine learning model for online advertising system. Hive to me is more like a ad-hoc query engine rather than a platform where I can develop complex algorithm on
Schema on any format HDFS files. Easy to download the data. A complete tool similar to database tool like toad.
Performance,sometime it is very difficult to run queries. Gui can be improved with more user friendly options
Data processing for regulatory reporting ...maintain lineage
I was a top fan of Impala for a while until I reached a series of limitations that were impossible to overcome. I work a lot with arrays and just the fact of being able to use array_contains in impala made me switch to Hive. Also, we are moving fast on the direction of self made Macross for hive that let us do complex queries without lateral view explodes
Session creation takes a while and speed is quite slow when comparing to Impala
Complex data analysis with tables that have several billion rows by partition
It is highly flexible in configurations. So many options to load data from- directly from linux file system or hdfs. You can create external and managed tables. One fun feature is that you can shoot bash commands from hive as well
It cannot be used for streaming data. Error logging can be improved so that error tracking and resolution can be more efficient.
It is used to transform and process Big Data datasets in batches. It can handle TBs of data. Push predicate feature has greatly improved the performance of the queries and the developer doesn't need to think about it anymore
Hive is great for handling logs in big data projects. We are using the same in our project and it is great for using joins and grouping which is very difficult and tricky in map reduce. It has a lot of udf packages and it is very easy to add new udfs. We were also using bucketing and clustering to optimize the query. Concept of external tables and the way we can manipulate data even when table is deleted from hive is really amazing. Lot of connectors available in the market for different softwares.
The thing which I dislike is latency and the way it saves data. While inserting data I have to wait a lot of few records. Compiler execution plan is very immature as it does not do proper query optimization. Though the community is working fast for overcoming quickly but I think it will take time for hive to be
We are using hive mainly for saving our logs. it helps us to keep track of what records are inserted, which records have failed and what are relationship between them. we are using tableau for analyzing data .
The progression of features, speed, etc brings me the strategic confidence I need in the SQL in hadoop space.
At this point, everything is on pint & theories it is great in hive 1.2
Deriving value from masses of unstructured & structured data.
The ability to view HDFS data in a relational format and easily query it through HiveQL
The fact that it uses MapReduce whether you query a pre existing table or a perform a complex query. Tez helps with this issue. Also the inability to delete/update data is a real issue and forces other services to be used eg HBase.
The ability to use Hive on HUE is perfect. We are building a platform for data scientists (prefer GUI to shell) to perform analysis so removing the need for command line is excellent.