When copying textual content from a web site to your machine’s clipboard, there’s a superb probability that you’ll get the formatted HTML when pasting it. Some apps and working methods have a “Paste Particular” characteristic that may strip these tags out so that you can keep the present fashion, however what do you do if that’s unavailable?
Identical goes for changing plain textual content into formatted HTML. One of many closest methods we will convert plain textual content into HTML is writing in Markdown as an abstraction. You could have seen examples of this in lots of remark kinds in articles similar to this one. Write the remark in Markdown and it’s parsed as HTML.
Even higher could be no abstraction in any respect! You could have additionally seen (and used) quite a lot of on-line instruments that take plainly written textual content and convert it into formatted HTML. The UI makes the conversion and previews the formatted lead to actual time.
Offering a means for customers to creator primary net content material — like feedback — with out realizing even the very first thing about HTML, is a novel pursuit because it lowers boundaries to speaking and collaborating on the internet. Saying it helps “democratize” the online could also be heavy-handed, however it doesn’t battle with that imaginative and prescient!
We are able to construct a device like this ourselves. I’m all for utilizing current sources the place attainable, however I’m additionally for demonstrating how this stuff work and possibly studying one thing new within the course of.
Defining The Scope
There are many assumptions and issues that would go right into a plain-text-to-HTML converter. For instance, ought to we assume that the primary line of textual content entered into the device is a title that wants corresponding <h1> tags? Is every new line really a paragraph, and the way does linking content material match into this?
Once more, the thought is {that a} consumer ought to be capable of write with out realizing Markdown or HTML syntax. This can be a large constraint, and there are far too many HTML parts we’d encounter, so it’s value realizing the context through which the content material is getting used. For instance, if this can be a device for writing weblog posts, then we will restrict the scope of which parts are supported based mostly on these which might be generally utilized in long-form content material: <h1>, <p>, <a>, and <img>. In different phrases, it is going to be attainable to incorporate top-level headings, physique textual content, linked textual content, and pictures. There can be no assist for bulleted or ordered lists, tables, or another parts for this explicit device.
The front-end implementation will depend on vanilla HTML, CSS, and JavaScript to determine a small kind with a easy structure and performance that converts the textual content to HTML. There’s a server-side side to this when you plan on deploying it to a manufacturing surroundings, however our focus is only on the entrance finish.
Wanting At Present Options
There are current methods to perform this. For instance, some libraries supply a WYSIWYG editor. Import a library like TinyMCE with a single <script> and also you’re good to go. WYSIWYG editors are highly effective and assist all types of formatting, even making use of CSS lessons to content material for styling.
However TinyMCE isn’t probably the most environment friendly bundle at about 500 KB minified. That’s not a criticism as a lot as a sign of how a lot performance it covers. We wish one thing extra “barebones” than that for our easy function. Looking GitHub surfaces extra potentialities. The options, nonetheless, appear to fall into one among two classes:
The enter accepts plain textual content, however the generated HTML solely helps the HTML <h1> and <p> tags.
The enter converts plain textual content into formatted HTML, however by ”plain textual content,” the device appears to imply “Markdown” (or a wide range of it) as a substitute. The txt2html Perl module (from 1994!) would fall beneath this class.
Even when an ideal resolution for what we would like was already on the market, I’d nonetheless wish to decide aside the idea of changing textual content to HTML to know the way it works and hopefully study one thing new within the course of. So, let’s proceed with our personal homespun resolution.
Setting Up The HTML
We’ll begin with the HTML construction for the enter and output. For the enter aspect, we’re most likely greatest off utilizing a <textarea>. For the output aspect and associated styling, decisions abound. The next is merely one instance with some very primary CSS to put the enter <textarea> on the left and an output <div> on the fitting:
See the Pen Base Type Kinds [forked] by Geoff Graham.
You possibly can additional develop the CSS, however that isn’t the main target of this text. There isn’t a query that the design will be prettier than what I’m offering right here!
Seize The Plain Textual content Enter
We’ll set an onkeyup occasion handler on the <textarea> to name a JavaScript perform referred to as convert() that does what it says: convert the plain textual content into HTML. The conversion perform ought to settle for one parameter, a string, for the consumer’s plain textual content enter entered into the <textarea> aspect:
<textarea onkeyup=’convert(this.worth);’></textarea>
onkeyup is a more sensible choice than onkeydown on this case, as onkeyup will name the conversion perform after the consumer completes every keystroke, versus earlier than it occurs. This manner, the output, which is refreshed with every keystroke, all the time contains the newest typed character. If the conversion is triggered with an onkeydown handler, the output will exclude the latest character the consumer typed. This may be irritating when, for instance, the consumer has completed typing a sentence however can’t but see the ultimate punctuation mark, say a interval (.), within the output till typing one other character first. This creates the impression of a typo, glitch, or lag when there’s none.
In JavaScript, the convert() perform has the next tasks:
Encode the enter in HTML.
Course of the enter line-by-line and wrap every particular person line in both a <h1> or <p> HTML tag, whichever is most acceptable.
Course of the output of the transformations as a single string, wrap URLs in HTML <a> tags, and substitute picture file names with <img> parts.
And from there, we show the output. We are able to create separate features for every duty. Let’s identify them accordingly:
html_encode()
convert_text_to_HTML()
convert_images_and_links_to_HTML()
Every perform accepts one parameter, a string, and returns a string.
Encoding The Enter Into HTML
Use the html_encode() perform to HTML encode/sanitize the enter. HTML encoding refers back to the technique of escaping or changing sure characters in a string enter to forestall customers from inserting their very own HTML into the output. At a minimal, we must always substitute the next characters:
< with <
> with >
& with &
‘ with '
” with "
JavaScript doesn’t present a built-in solution to HTML encode enter as different languages do. For instance, PHP has htmlspecialchars(), htmlentities(), and strip_tags() features. That mentioned, it’s comparatively straightforward to write down our personal perform that does this, which is what we’ll use the html_encode() perform for that we outlined earlier:
perform html_encode(enter) {
const textArea = doc.createElement(“textarea”);
textArea.innerText = enter;
return textArea.innerHTML.break up(“<br>”).be part of(“n”);
}
HTML encoding of the enter is a crucial safety consideration. It prevents undesirable scripts or different HTML manipulations from getting injected into our work. Granted, front-end enter sanitization and validation are each merely deterrents as a result of dangerous actors can bypass them. However we might as nicely make them work a bit of more durable.
So long as we’re on the subject of securing our work, ensure that to HTML-encode the enter on the again finish, the place the consumer can’t intervene. On the identical time, take care to not encode the enter greater than as soon as. Encoding textual content that’s already HTML-encoded will break the output performance. The very best method for back-end storage is for the entrance finish to move the uncooked, unencoded enter to the again finish, then ask the back-end to HTML-encode the enter earlier than inserting it right into a database.
That mentioned, this solely accounts for sanitizing and storing the enter on the again finish. We nonetheless should show the encoded HTML output on the entrance finish. There are at the very least two approaches to think about:
Convert the enter to HTML after HTML-encoding it and earlier than it’s inserted right into a database.
That is environment friendly, because the enter solely must be transformed as soon as. Nevertheless, that is additionally an rigid method, as updating the HTML turns into tough if the output necessities occur to vary sooner or later.
Retailer solely the HTML-encoded enter textual content within the database and dynamically convert it to HTML earlier than displaying the output for every content material request.
That is much less environment friendly, because the conversion will happen on every request. Nevertheless, it is usually extra versatile because it’s attainable to replace how the enter textual content is transformed to HTML if necessities change.
Making use of Semantic HTML Tags
Let’s use the convert_text_to_HTML() perform we outlined earlier to wrap every line of their respective HTML tags, that are going to be both <h1> or <p>. To find out which tag to make use of, we’ll break up the textual content enter on the newline character (n) in order that the textual content is processed as an array of strains relatively than a single string, permitting us to judge them individually.
// Output variable
let loose = ”;
// Break up textual content on the newline character into an array
const txt_array = txt.break up(“n”);
// Get the variety of strains within the array
const txt_array_length = txt_array.size;
// Variable to maintain observe of the (non-blank) line quantity
let non_blank_line_count = 0;
for (let i = 0; i < txt_array_length; i++) {
// Get the present line
const line = txt_array[i];
// Proceed if a line comprises no textual content characters
if (line === ”){
proceed;
}
non_blank_line_count++;
// If a line is the primary line that comprises textual content
if (non_blank_line_count === 1){
// …wrap the road of textual content in a Heading 1 tag
out += <h1>${line}</h1>;
// …in any other case, wrap the road of textual content in a Paragraph tag.
} else {
out += <p>${line}</p>;
}
}
return out;
}
In brief, this little snippet loops via the array of break up textual content strains and ignores strains that do not include any textual content characters. From there, we will consider whether or not a line is the primary one within the collection. Whether it is, we slap a <h1> tag on it; in any other case, we mark it up in a <p> tag.
This logic might be used to account for different forms of parts that you could be wish to embrace within the output. For instance, maybe the second line is assumed to be a byline that names the creator and hyperlinks as much as an archive of all creator posts.
Tagging URLs And Pictures With Common Expressions
Subsequent, we’re going to create our convert_images_and_links_to_HTML() perform to encode URLs and pictures as HTML parts. It’s a superb chunk of code, so I’ll drop it in and we’ll instantly begin selecting it aside collectively to clarify the way it all works.
perform convert_images_and_links_to_HTML(string){
let urls_unique = [];
let images_unique = [];
const urls = string.match(/https*://[^s<),]+[^s<),.]/gmi) ?? [];
const imgs = string.match(/[^”‘>s]+.(jpg|jpeg|gif|png|webp)/gmi) ?? [];
const urls_length = urls.size;
const images_length = imgs.size;
for (let i = 0; i < urls_length; i++){
const url = urls[i];
if (!urls_unique.contains(url)){
urls_unique.push(url);
}
}
for (let i = 0; i < images_length; i++){
const img = imgs[i];
if (!images_unique.contains(img)){
images_unique.push(img);
}
}
const urls_unique_length = urls_unique.size;
const images_unique_length = images_unique.size;
for (let i = 0; i < urls_unique_length; i++){
const url = urls_unique[i];
if (images_unique_length === 0 || !images_unique.contains(url)){
const a_tag = <a href=”${url}” goal=”_clean”>${url}</a>;
string = string.substitute(url, a_tag);
}
}
for (let i = 0; i < images_unique_length; i++){
const img = images_unique[i];
const img_tag = <img src=”${img}” alt=””>;
const img_link = <a href=”${img}”>${img_tag}</a>;
string = string.substitute(img, img_link);
}
return string;
}
Not like the convert_text_to_HTML() perform, right here we use common expressions to establish the phrases that should be wrapped and/or changed with <a> or <img> tags. We do that for a few causes:
The earlier convert_text_to_HTML() perform handles textual content that may be remodeled to the HTML block-level parts <h1> and <p>, and, if you would like, different block-level parts comparable to <tackle>. Block-level parts within the HTML output correspond to discrete strains of textual content within the enter, which you’ll consider as paragraphs, the textual content entered between presses of the Enter key.
Alternatively, URLs within the textual content enter are sometimes included in the midst of a sentence relatively than on a separate line. Pictures that happen within the enter textual content are sometimes included on a separate line, however not all the time. Whilst you might establish textual content that represents URLs and pictures by processing the enter line-by-line — and even word-by-word, if mandatory — it’s simpler to make use of common expressions and course of your complete enter as a single string relatively than by particular person strains.
Common expressions, although they’re highly effective and the suitable device to make use of for this job, include a efficiency value, which is another excuse to make use of every expression solely as soon as for your complete textual content enter.
Bear in mind: All of the JavaScript on this instance runs every time the consumer varieties a personality, so it is very important maintain issues as light-weight and environment friendly as attainable.
I additionally wish to make a remark in regards to the variable names in our convert_images_and_links_to_HTML() perform. pictures (plural), picture (singular), and hyperlink are reserved phrases in JavaScript. Consequently, imgs, img, and a_tag had been used for naming. Apparently, these particular reserved phrases aren’t listed on the related MDN web page, however they’re on W3Schools.
We’re utilizing the String.prototype.match() perform for every of the 2 common expressions, then storing the outcomes for every name in an array. From there, we use the nullish coalescing operator (??) on every name in order that, if no matches are discovered, the consequence can be an empty array. If we don’t do that and no matches are discovered, the results of every match() name can be null and can trigger issues downstream.
const imgs = string.match(/[^”‘>s]+.(jpg|jpeg|gif|png|webp)/gmi) ?? [];
Subsequent up, we filter the arrays of outcomes in order that every array comprises solely distinctive outcomes. This can be a crucial step. If we don’t filter out duplicate outcomes and the enter textual content comprises a number of cases of the identical URL or picture file identify, then we break the HTML tags within the output. JavaScript doesn’t present a easy, built-in methodology to get distinctive objects in an array that’s akin to the PHP array_unique() perform.
The code snippet works round this limitation utilizing an admittedly ugly however easy procedural method. The identical downside is solved utilizing a extra practical method when you want. There are a lot of articles on the internet describing varied methods to filter a JavaScript array so as to maintain solely the distinctive objects.
We’re additionally checking if the URL is matched as a picture earlier than changing a URL with an acceptable <a> tag and performing the alternative provided that the URL doesn’t match a picture. We could possibly keep away from having to carry out this verify through the use of a extra intricate common expression. The instance code intentionally makes use of common expressions which might be maybe much less exact however hopefully simpler to know in an effort to maintain issues so simple as attainable.
And, lastly, we’re changing picture file names within the enter textual content with <img> tags which have the src attribute set to the picture file identify. For instance, my_image.png within the enter is remodeled into <img src=’my_image.png’> within the output. We wrap every <img> tag with an <a> tag that hyperlinks to the picture file and opens it in a brand new tab when clicked.
There are a few advantages to this method:
In a real-world situation, you’ll probably use a CSS rule to constrain the scale of the rendered picture. By making the photographs clickable, you present customers with a handy solution to view the full-size picture.
If the picture is just not a neighborhood file however is as a substitute a URL to a picture from a 3rd social gathering, this can be a solution to implicitly present attribution. Ideally, you shouldn’t rely solely on this methodology however, as a substitute, present express attribution beneath the picture in a <figcaption>, <cite>, or related aspect. But when, for no matter motive, you’re unable to supply express attribution, you’re at the very least offering a hyperlink to the picture supply.
It could go with out saying, however “hotlinking” pictures is one thing to keep away from. Use solely regionally hosted pictures wherever attainable, and supply attribution if you don’t maintain the copyright for them.
Earlier than we transfer on to displaying the transformed output, let’s speak a bit about accessibility, particularly the picture alt attribute. The instance code I offered does add an alt attribute within the conversion however doesn’t populate it with a worth, as there isn’t a straightforward solution to robotically calculate what that worth must be. An empty alt attribute will be acceptable if the picture is taken into account “ornamental,” i.e., purely supplementary to the encircling textual content. However one might argue that there isn’t a such factor as a purely ornamental picture.
That mentioned, I think about this to be a limitation of what we’re constructing.
Displaying the Output HTML
We’re on the level the place we will lastly work on displaying the HTML-encoded output! We have already dealt with all of the work of changing the textual content, so all we actually must do now could be name it:
output.innerHTML = convert_images_and_links_to_HTML(convert_text_to_HTML(html_encode(input_string)));
}
Should you would relatively show the output string as uncooked HTML markup, use a <pre> tag because the output aspect as a substitute of a <div>:
<pre id=’output’></pre>
The one factor to notice about this method is that you’d goal the <pre> aspect’s textContent as a substitute of innerHTML:
output.textContent = convert_images_and_links_to_HTML(convert_text_to_HTML(html_encode(input_string)));
}
Conclusion
We did it! We constructed one of many identical kind of copy-paste device that converts plain textual content on the spot. On this case, we’ve configured it in order that plain textual content entered right into a <textarea> is parsed line-by-line and encoded into HTML that we format and show inside one other aspect.
See the Pen Convert Plain Textual content to HTML (PoC) [forked] by Geoff Graham.
We had been even capable of maintain the answer pretty easy, i.e., vanilla HTML, CSS, and JavaScript, with out reaching for a third-party library or framework. Does this easy resolution do every little thing a ready-made device like a framework can do? Completely not. However an answer so simple as that is typically all you want: nothing extra and nothing much less.
So far as scaling this additional, the code might be modified to POST what’s entered into the <kind> utilizing a PHP script or the like. That may be an incredible train, and when you do it, please share your work with me within the feedback as a result of I’d like to test it out.
References
“Learn how to HTML-encode a String” (W3Docs)
“Learn how to escape & unescape HTML characters in string in JavaScript” (Educative.io)
“Learn how to get all distinctive values (take away duplicates) in a JavaScript array?”” (GeeksforGeeks)
“Getting Distinctive Array Values in Javascript and Typescript,” Chris Engelsma
“Threats of Utilizing Common Expressions in JavaScript,” Dulanka Karunasena
Subscribe to MarketingSolution.
Receive web development discounts & web design tutorials.
Now! Lets GROW Together!